This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.25.0 to 1.26.0 (#174)
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.25.0...v1.26.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Loading branch information
604783f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5375
ns5125
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5250
ns6937.5
ns0.76
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7708.5
ns7417
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5416
ns6083
ns0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
113361
ns104885
ns1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2795172
ns2678307
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
601544
ns401685
ns1.50
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9729.5
ns9917
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9938
ns10042
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10167
ns10750
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
11063
ns9729
ns1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
544547
ns495998
ns1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
17852957
ns18744208
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
629346
ns680377
ns0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1500
ns1458
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1458
ns1541.5
ns0.95
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1771
ns1750
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1583
ns3187.5
ns0.50
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
20770
ns20316
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1342503
ns1305124
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
30997
ns31190.5
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4104
ns4334
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4500
ns4041
ns1.11
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4500
ns4083
ns1.10
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4333
ns4354
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
134970
ns134077
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
8677498
ns8979794
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
138579
ns148416.5
ns0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57666.5
ns57500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46875
ns46667
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns39917
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81458
ns83500
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36587
ns37564
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
582336
ns567840.5
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
69420
ns80616
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030375
ns2038666
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2088625
ns2081166
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2086625
ns2084042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1998562
ns1991875
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
217216
ns223666
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
8077777
ns7677352
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
930850
ns1187113
ns0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
175083
ns146541
ns1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147291
ns148041.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150021
ns151625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
151750
ns176750
ns0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166825
ns166355.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7358467.5
ns7478548
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
262570
ns190117
ns1.38
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1115103.5
ns1106833.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1110771
ns1109708
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1113771
ns1125750
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1136250
ns1112687.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
639845.5
ns654461
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33057102
ns33783553
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
864075
ns1021271
ns0.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3792
ns5333
ns0.71
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4479
ns5125
ns0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6583
ns5750
ns1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6375
ns5084
ns1.25
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
85209.5
ns83746
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5875726.5
ns5563998.5
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
59531
ns61491
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8417
ns8792
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8750
ns8625
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9042
ns9250
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8958
ns8417
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
557500.5
ns559136
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
34838164
ns34995936.5
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
370833
ns392504
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17958
ns17083
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16458
ns18000
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21125
ns18791.5
ns1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17292
ns17708.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
63776.5
ns63135.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
2927491.5
ns3027434.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
82870
ns74881
ns1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212625
ns218791
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213042
ns212063
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212771
ns213375
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212291
ns218250
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
329859
ns334874
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
12611094
ns15538427
ns0.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
405232
ns465885
ns0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
667
ns625
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns708
ns0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
875
ns792
ns1.10
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
709
ns667
ns1.06
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
19101
ns19376
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1145778
ns1181689
ns0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
26409
ns30801
ns0.86
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1458
ns1375
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1334
ns1417
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1583
ns1500
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1375
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
117126.5
ns115818
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8850213
ns8578264
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
115676
ns125221.5
ns0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7291
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns6125
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns5458
ns1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns10375
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23587
ns24404
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1261233
ns1185331.5
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
52723
ns47150
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229167
ns259875
ns0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230667
ns239750
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
267875
ns238375
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
257458
ns212937.5
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
182744
ns194467.5
ns0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
32590762.5
ns30488731
ns1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
548449.5
ns603521
ns0.91
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns4084
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns4125
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns4084
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
22860
ns23361
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
1933593
ns1914869
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
39504
ns47581
ns0.83
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
17042
ns16958
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16875
ns16875
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17083
ns16667
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16875
ns16750
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
185787.5
ns186194.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
10029430
ns9861733
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
162052
ns172361.5
ns0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
491583
ns490917
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
385625
ns385541
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
386458
ns313292
ns1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
844083
ns846958.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113763
ns113486.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
418213
ns398692.5
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
388657
ns245177.5
ns1.59
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2155583
ns2139937
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1863374.5
ns1863583
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1865167
ns1584583.5
ns1.18
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3377520.5
ns3114083
ns1.08
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
229580
ns229713.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
9922983
ns11955773.5
ns0.83
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
610962
ns745073
ns0.82
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6500
ns7104
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5500
ns6792
ns0.81
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7667
ns7083
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5167
ns6229.5
ns0.83
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
84720.5
ns83179
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5300415
ns6726845
ns0.79
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
59932
ns59261
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11229
ns10250
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11395.5
ns11458
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12334
ns11895.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10667
ns11166.5
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
602168
ns592614
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
38613143.5
ns37936205.5
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
383917
ns410389
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23328
ns23257
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2178076
ns2214765
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
41367
ns48421
ns0.85
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2167
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2166
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2208
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
228927.5
ns230148
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
11774524
ns11848732
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
165900
ns178962
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9584
ns8917
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8333
ns8917
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9895.5
ns10083
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8542
ns9208
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
105241
ns99883.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3103348.5
ns3281834
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
71955
ns73811
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17688
ns17438
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16666.5
ns17125
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18708
ns19125
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17562
ns17375
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
595171
ns574862.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
16252508
ns17368412
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
358129
ns382279
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns459
ns1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
458
ns583
ns0.79
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
458
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34578
ns34631
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1237584
ns1211808
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
41387
ns48701
ns0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9229
ns9146
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8958.5
ns9521
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9750
ns10229.5
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8104
ns8604
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
257823
ns260130.5
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
18331589
ns19439989.5
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
349944
ns367164
ns0.95
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397270.5
ns396854.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288083
ns288229.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288666.5
ns215042
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
751792
ns755958
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112022
ns112250
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
349915
ns328996
ns1.06
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
74609
ns75451
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1454270.5
ns1462500
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1130500
ns1136041
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1131583
ns860334
ns1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2437959
ns2439875
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
200057
ns199853.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
7687949
ns9985334
ns0.77
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
302285
ns324698
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7750
ns7000
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7083.5
ns7687.5
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8312.5
ns8375
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6687.5
ns7499.5
ns0.89
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
139766
ns138856
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5685169
ns6055720
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
60383
ns60111
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13479.5
ns15874.5
ns0.85
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
12750
ns16271
ns0.78
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15125
ns15792
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14625.5
ns13125.5
ns1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
923489
ns911828
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
42519536.5
ns42608795.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
407432
ns429664
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25625
ns24000
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23666
ns25958
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
29417
ns26833.5
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24041
ns24937.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
186240.5
ns189463
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7554376
ns7536335
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
120505
ns112782
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
152187
ns146084
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
145250
ns152541.5
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
146917
ns105833
ns1.39
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103958
ns153500
ns0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1013659
ns1027043
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44493070
ns41813684
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
535240
ns587426
ns0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74583
ns74042
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79584
ns84500
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76791.5
ns74917
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76083
ns74333
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
190594.5
ns195104
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7364811
ns7388961
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
121316.5
ns121551
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
273562.5
ns281250
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
304084
ns290833
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
303333
ns244667
ns1.24
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
307583
ns297125
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1045024
ns1044893
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
39473308
ns40287331
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
624192
ns693978
ns0.90
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12417
ns12583.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12896
ns13333.5
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14000
ns14000
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12500
ns13125
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
138416
ns137568.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5479910
ns5655781
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
226152
ns235892
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27792
ns27458
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26458
ns28437
ns0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28437.5
ns27583
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
33937.5
ns25396
ns1.34
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
924126.5
ns925629.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
42086872
ns42183215
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
610976
ns696807
ns0.88
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11124.5
ns10583.5
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10333
ns11729
ns0.88
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12479.5
ns14020.5
ns0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11125
ns11166
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
118543.5
ns119207
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3443799.5
ns3459447
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
233176
ns241797.5
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22291.5
ns22228.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
22417
ns22979
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
24167
ns24041
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
28562.5
ns22958
ns1.24
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
668341
ns679984
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
21034051
ns21093495.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
569113
ns675492
ns0.84
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
68709
ns65145.5
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
62750
ns69062
ns0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
67520.5
ns67375
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
64417
ns63250
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102389
ns102654.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3441143
ns3365331
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
230751
ns244962
ns0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
506375
ns512250
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
510167
ns511875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
475209
ns467958.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
647896
ns464791
ns1.39
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
492781
ns497974
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20664230
ns19959026
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
593680
ns716037
ns0.83
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7958
ns7458
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6750
ns7479.5
ns0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8208
ns8791.5
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7562.5
ns7000
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
137965
ns136611.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5508177.5
ns5668588
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
62687
ns59181
ns1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16125
ns16084
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16250
ns16104
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16250
ns15145.5
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14833
ns15292
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
900927
ns892529
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
39349971
ns37494483
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
388286
ns399300
ns0.97
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6150354
ns6148250
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6368167
ns6373958.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6373937.5
ns3229667
ns1.97
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11915167
ns11910625
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
345749
ns348836
ns0.99
batchedmm(512, Bsize=4)/forward/GPU/oneAPI
49052559
ns48313142
ns1.02
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
388426
ns303493
ns1.28
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19083437.5
ns19111312.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19960479.5
ns19956500
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19966834
ns11118833
ns1.80
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
37142104
ns36495125
ns1.02
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1072087
ns1010983
ns1.06
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI
78467188
ns77165819
ns1.02
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1035750.5
ns1185177
ns0.87
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1000
ns1042
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1042
ns1042
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
958
ns959
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23415
ns23306
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2079171
ns2102392
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
200906
ns210582
ns0.95
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3917
ns3958
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4000
ns3959
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4041
ns4041
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5458
ns3958
ns1.38
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
270573.5
ns274898
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10484095
ns10835037
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
486775
ns633051.5
ns0.77
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8687
ns7667
ns1.13
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7459
ns8271
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9334
ns10167
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7834
ns7999.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
116220
ns116562
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3435001.5
ns3281805
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
71133
ns68781
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
12125
ns11833
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11958
ns12271
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
13000
ns13500
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11750
ns12209
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
609643.5
ns610392
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
21784602
ns20835527
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
341729
ns356904
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
291
ns291
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22413
ns22489
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2035110
ns2031329
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
44053
ns49170
ns0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3000
ns3042
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2917
ns2958
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3208
ns3209
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2916
ns2834
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
194923.5
ns196092.5
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9225861.5
ns9721843.5
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
154488.5
ns166151.5
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11542
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10500
ns12584
ns0.83
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12875
ns13333.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11875
ns10959
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
115370
ns115616.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3433218
ns3435294.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
231793
ns238972
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22667
ns22250
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
22104.5
ns22583.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23625
ns22875
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
26729
ns23104.5
ns1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
555861
ns561561
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20482208
ns19972002
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
545740
ns652206
ns0.84
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4334
ns4167
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4333
ns4375
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4458
ns0.94
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4375
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
23923
ns24400
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2205811
ns2160362
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
44864
ns49090
ns0.91
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16500
ns16167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16333
ns16625
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16166
ns16291
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16292
ns16584
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
319806
ns320232
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
10190777
ns12103289.5
ns0.84
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
186077
ns205902
ns0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2125
ns2084
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2084
ns2209
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2209
ns2167
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2000
ns2125
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35327
ns35395
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1213779
ns1121264.5
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
199242
ns218222
ns0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17104
ns17896
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20167
ns17916
ns1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
19000
ns19125
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23083.5
ns18146
ns1.27
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
284984
ns286121
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
18211018
ns20551833.5
ns0.89
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
583431
ns685457
ns0.85
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59458
ns60208.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
65666
ns65458
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66125
ns60938
ns1.09
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
52833
ns53875
ns0.98
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66304
ns66633
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI
87707222.5
ns86298273
ns1.02
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
110241
ns102431
ns1.08
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
153041
ns197791.5
ns0.77
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
155229
ns162042
ns0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
130209
ns137250
ns0.95
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
286334
ns295208
ns0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
210129.5
ns211289
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI
149924497
ns152039178
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
511145
ns510905
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
106521
ns123834
ns0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
78958
ns123125
ns0.64
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84042
ns84312.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
115521
ns90875
ns1.27
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191513.5
ns193182.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5334020
ns5322780
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
267630
ns192502
ns1.39
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1894896
ns1921875
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1902375
ns1909416
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1878334
ns1888250
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1895250
ns1881750
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
507442
ns510619
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28152566.5
ns26283882
ns1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
825763
ns911709
ns0.91
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21516
ns21603
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2100524
ns2089663
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
35507
ns42021
ns0.84
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
245735
ns246530.5
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
9780504
ns9718939
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
164548
ns183711
ns0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10916
ns8083
ns1.35
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8291
ns9791
ns0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11146
ns12125
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9500
ns8458
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
114788
ns115667.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3351587
ns3479265.5
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
232004
ns238712
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8916
ns10334
ns0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8854.5
ns10375
ns0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10917
ns10709
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9583
ns10750
ns0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
491693
ns493762.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
19969043
ns19419012
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
536332
ns631376
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57958
ns58459
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46625
ns46541
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46750
ns39791
ns1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83166
ns82958
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38476.5
ns39195
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1460287
ns1326636
ns1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
71814
ns77861
ns0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1905145.5
ns1927333.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1949542
ns1977312
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1958500
ns1955167
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1874958
ns1892417
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
212675
ns217765.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33332615
ns33483865
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
968925.5
ns1004015.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267500
ns267875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
271479.5
ns277417
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
271209
ns270958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268209
ns278250
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
194219.5
ns198525
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7638787
ns7684906
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
271267
ns283563
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
585333.5
ns614937.5
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
600292
ns658104
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
671042
ns590146
ns1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
845604.5
ns646750.5
ns1.31
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
991966
ns1004951
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
42952243
ns44721716
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
831153
ns899859
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2211666
ns2206250
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2203958
ns2176625
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2229083
ns2107416
ns1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2173792
ns2210708
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
161646
ns158799
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8668502.5
ns8305150
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
470965
ns412934
ns1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5493104.5
ns5495166.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5515875
ns5498084
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5526542
ns5497292
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6852458
ns5479145.5
ns1.25
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
959137
ns942447
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
49532486
ns52379643
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1437405
ns1717957
ns0.84
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
478292
ns476375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
345625
ns344833
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
346750
ns255667
ns1.36
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
908542
ns909083
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46909
ns46257.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
871386
ns876632
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
393175
ns245143
ns1.60
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2137500
ns2148125
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1869334
ns1855417
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1859271
ns1588042
ns1.17
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3380209
ns3122292
ns1.08
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
264095.5
ns253305
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
13390420
ns13286897
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
632907.5
ns772413
ns0.82
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57458
ns57958.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46166
ns45791.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns39417
ns1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78667
ns82625
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28560
ns28551
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1394875.5
ns1363872
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
73147
ns74231
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2029292
ns2040292
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2078187.5
ns2064375
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2063250
ns2084167
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1963958
ns1983271
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
230846.5
ns225739
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
36347331
ns35716396.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
980522
ns1031871
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58083.5
ns58333
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46584
ns46834
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46917
ns39667
ns1.18
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79958
ns83000
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48944
ns48471
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
829446
ns789293.5
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
71428.5
ns71026
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1871729
ns1926917
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1973604
ns1963709
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1944167
ns1974354
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1876792
ns1891625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
238010
ns232200
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
18705710.5
ns17717639
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
881607.5
ns916564
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns291
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
291
ns292
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34878
ns33909
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1190778.5
ns1226571
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
47028
ns45910
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6270.5
ns5916
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6187.5
ns7187.5
ns0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7375
ns7459
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6333
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
211705.5
ns201066
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
20119098
ns20694042
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
332741
ns365424
ns0.91
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32902
ns32008
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1224139
ns1150720
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
36327
ns37940
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2667
ns2709
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2667
ns3041
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
4292
ns3708
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3167
ns3500
ns0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
187662.5
ns181870
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
5673429
ns7654347.5
ns0.74
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
136635
ns149631
ns0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
467208
ns491875
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
469417
ns465938
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
466875
ns469979
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
464979.5
ns495375
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
137312
ns134587.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5812904.5
ns6261994
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
361475
ns348083
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4027749.5
ns4056250
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4071500
ns4071312.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4067417
ns4083458.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5516750
ns4067500
ns1.36
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
690445
ns675142
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32063716
ns34719295
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1091915
ns1296728
ns0.84
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49879250
ns49815354
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35487583
ns35531875
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35512833.5
ns25976083
ns1.37
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96974083
ns96976979
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1622377
ns1620332
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI
55868634.5
ns55439103
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1579230
ns1059456
ns1.49
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154423062.5
ns154432166.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112364750
ns112364500.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112377416
ns88728958
ns1.27
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
299989812
ns298587354.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6468945
ns6497993.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI
126761495
ns126106582
ns1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
7230228
ns5589506
ns1.29
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
19104.5
ns18292
ns1.04
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
18375
ns17542
ns1.05
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
17375.5
ns13625
ns1.28
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15083
ns16583.5
ns0.91
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19621
ns19675
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1223248
ns1142269.5
ns1.07
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
28854
ns27480
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11062.5
ns11000
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
8833
ns9020.5
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9291
ns7792
ns1.19
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17667
ns17375
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
252067.5
ns242665
ns1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
9844493
ns10148653
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
138484
ns144671.5
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7937.5
ns7958.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8125
ns9125
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10375
ns10375
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8708
ns7833.5
ns1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
120230.5
ns117743.5
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3557828.5
ns3571636.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
235119
ns238312
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9708
ns9083
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9084
ns10188
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9792
ns11500
ns0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10667
ns9500
ns1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
599437
ns580494.5
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22720103
ns24076504
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
557070
ns649931.5
ns0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9291.5
ns9416
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8812.5
ns9709
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9917
ns10458
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8958.5
ns9396
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
118821
ns114984
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3465548.5
ns3341616
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
71593
ns71321
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13687.5
ns13916.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13604.5
ns13541.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14395.5
ns17208.5
ns0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
14750
ns13187.5
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
570663
ns552056
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20121784.5
ns20781499.5
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
323504
ns344233
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
542
ns500
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
35088
ns33628
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1218149.5
ns1186325
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
203871
ns207932
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7562.5
ns7437
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7667
ns8584
ns0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns9666
ns0.81
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8520.5
ns7354.5
ns1.16
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
227876
ns221757
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
22566032
ns22841477
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
569945
ns657467
ns0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16458
ns16583
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
17041
ns16958
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
16209
ns12354
ns1.31
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10979
ns11625
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
20941
ns19779
ns1.06
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1150830
ns1178666.5
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
182992
ns191642
ns0.95
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
35666
ns35375
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
35167
ns35479
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
36000
ns35479.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
57833
ns35584
ns1.63
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
265749
ns258411
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12188303
ns11074698.5
ns1.10
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
534293
ns591756
ns0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
447500
ns449333
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
488042
ns450125
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
455709
ns463875
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
496916
ns486917
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195513
ns194667
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5997948.5
ns5885088
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
328714
ns347133
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4024209
ns4054500
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4055021
ns4060604.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4053917
ns4063834
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5501562.5
ns4052291.5
ns1.36
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
521631.5
ns510233
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27256015
ns28172431.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1059038
ns1353408.5
ns0.78
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
836727208
ns780318375
ns1.07
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
553913292
ns543371375
ns1.02
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
540736625
ns415007687
ns1.30
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1517196875
ns1572225062.5
ns0.96
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22767789
ns22558969
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/oneAPI
174930068
ns174041531
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
10331681
ns14555295
ns0.71
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
3773348667
ns2500858833
ns1.51
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1782084291
ns1786181583
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1780399750
ns1510021583
ns1.18
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4786718666
ns6317458166
ns0.76
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118657187
ns119503116
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI
1332561794
ns931368955.5
ns1.43
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
67063298
ns87832876
ns0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76542
ns76375
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76584
ns77083
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79583
ns83334
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76708.5
ns75354
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
195943.5
ns194473.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
5455658.5
ns8155928
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
123300.5
ns106291
ns1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
191292
ns277375
ns0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
252042
ns193666.5
ns1.30
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
199562.5
ns291542
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225542
ns203875
ns1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1004442
ns999103
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43458500
ns42482446
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
590764
ns628231.5
ns0.94
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199694520.5
ns199366166.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
138856500
ns139444084
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139241166
ns103950000
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
393790959
ns388306958
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5842492
ns5837076.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI
78913006.5
ns78178829
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
4746717.5
ns3620336
ns1.31
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
617676375.5
ns617703104.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
439446917
ns438890042
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
439765166.5
ns352507250
ns1.25
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1174222000
ns1183186458
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26723523
ns26786910.5
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI
276392509
ns274964991
ns1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
15854720
ns21952578.5
ns0.72
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns6125
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns5417
ns1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9834
ns9917
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26896.5
ns26517
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1173091
ns1160586
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
55173
ns46431
ns1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213041.5
ns224854
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227729
ns230541
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220416.5
ns229812.5
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206125
ns207958
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
219868
ns215879.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20153337
ns20490896
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
541982
ns528825
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8521
ns6458
ns1.32
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7458
ns9000
ns0.83
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11167
ns9750
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9250
ns8770.5
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
115361
ns109989.5
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3392154.5
ns3318372
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
74069
ns72691
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7562.5
ns7666.5
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7958
ns8417
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8167
ns11750
ns0.70
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7395.5
ns7562.5
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
495697
ns485874.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
20965461
ns19877956
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
309298
ns315043
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
417
ns458
ns0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
459
ns750
ns0.61
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns750
ns0.67
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
375
ns459
ns0.82
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26124
ns25151
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1243719
ns1214235
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
45334
ns48561
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9584
ns8833
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9062.5
ns9542
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9792
ns11834
ns0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9542
ns8250
ns1.16
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
247606
ns245667
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
24899790.5
ns23350383.5
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
382304
ns388103
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
112312.5
ns111708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
103229
ns101708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
104104.5
ns87542
ns1.19
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
155083
ns154542
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
23501
ns22556
ns1.04
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
811475
ns822944.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
192539
ns200302
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
536562
ns576604.5
ns0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
554250
ns577208
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
535291.5
ns579583
ns0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
910854
ns535334
ns1.70
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
221242
ns215893
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11751092
ns11598893
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
560216.5
ns606916
ns0.92
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5416.5
ns5500
ns0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
6208.5
ns6187.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6021
ns7583
ns0.79
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4000
ns5646
ns0.71
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17520
ns16999
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/oneAPI
72849606
ns71875004
ns1.01
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
73648
ns71250
ns1.03
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11562.5
ns12166.5
ns0.95
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11062
ns10833.5
ns1.02
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11000
ns11104
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16666
ns16667
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
207455.5
ns203355.5
ns1.02
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI
97442684
ns97881235
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
330387
ns362713
ns0.91
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39667
ns40375
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51291
ns51334
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52958.5
ns51083
ns1.04
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13625
ns13625
ns1
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20356
ns21217
ns0.96
batchedmm(16, Bsize=128)/forward/GPU/oneAPI
76663129
ns78292175
ns0.98
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
98364
ns81245.5
ns1.21
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36375.5
ns37437.5
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
31417
ns31833.5
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31229.5
ns30145.5
ns1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57000
ns57333
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
184178
ns180954
ns1.02
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI
111708023
ns111821475
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
355254
ns393694
ns0.90
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1750
ns1667
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
2042
ns1834
ns1.11
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2583
ns0.85
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1583
ns1.18
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19575
ns19103
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1219758.5
ns1181507
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
29099.5
ns29580
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2208
ns2291
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2167
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2375
ns2541
ns0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2208
ns2125
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
198996.5
ns192587.5
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
8766738.5
ns9137253
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
128571
ns137661
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4583
ns5041
ns0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4417
ns4792
ns0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6729
ns6708
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3958
ns5292
ns0.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
143699.5
ns139532
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5704411.5
ns5873388
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
61955.5
ns61421
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8334
ns8250
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8083.5
ns8583
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8709
ns9333
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8583
ns8062.5
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
836045.5
ns812160
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
39725172
ns39105619
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
364891
ns390114
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
54833
ns55000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
55833
ns55875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
55583
ns54333
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56000
ns56208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36570
ns36258
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1345223
ns1233762
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
202568
ns217247.5
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
476729
ns523187.5
ns0.91
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
494500
ns495646
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
494208
ns509125
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
641625
ns508354
ns1.26
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
259886
ns258312
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28017517.5
ns27334844
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
705894
ns802628
ns0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3310333
ns3307500
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2334062.5
ns2332208.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2333375
ns1767750
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6300479
ns6289687.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204581.5
ns205336
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/oneAPI
77398976
ns78138642
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
373097
ns213372
ns1.75
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11459729
ns11443687
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8305729.5
ns8355854.5
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8342854
ns6598583.5
ns1.26
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21088292
ns21066479
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
744676
ns735491
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI
121497637
ns121355919
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1994797.5
ns1063901
ns1.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4833
ns7208
ns0.67
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4646
ns6604
ns0.70
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7520.5
ns7708
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4917
ns4708
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
133339
ns130238.5
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5450569.5
ns5600093
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
61520
ns55701
ns1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7083
ns7604
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7291.5
ns7562.5
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns8083
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7416.5
ns7292
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
725863
ns714522
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
33872141
ns35658157
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
353680
ns368784
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
100459
ns98292
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
123042
ns103667
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
102417
ns127291
ns0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
121458.5
ns122417
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
151940.5
ns149309
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5695179
ns5831672
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
233346
ns183632
ns1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2033271
ns2028041
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2026417
ns2022292
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1997458.5
ns2031625
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2041833
ns2019021
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
678763
ns669751
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31810809
ns34116389.5
ns0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
931831
ns1113696
ns0.84
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
32666
ns32999.5
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36562.5
ns36208
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
36167
ns33125
ns1.09
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
667
ns542
ns1.23
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15627
ns15437
ns1.01
batchedmm(2, Bsize=4)/forward/GPU/oneAPI
72187220
ns72358742
ns1.00
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
70121
ns84900
ns0.83
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2604.5
ns2667
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2958
ns3000
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2937.5
ns3208
ns0.92
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2167
ns2250
ns0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
139744
ns136315
ns1.03
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI
92749943
ns92893398
ns1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
289641
ns350423
ns0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7208
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns6083
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5916
ns5416
ns1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns10167
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35855
ns35436
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1252207
ns1228537
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
53911
ns49691
ns1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212958.5
ns232749.5
ns0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222708
ns221125
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219917
ns227541.5
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206209
ns205750
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
243430
ns240533
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27468024.5
ns26122810
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
513269
ns509435
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3917
ns0.96
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3958
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3791
ns3917
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21959
ns21412
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2194149
ns2114597
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
35557
ns42980
ns0.83
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14500
ns14542
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14500
ns14917
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14500
ns14792
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14459
ns14917
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
302419
ns297410.5
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11036089
ns10838818
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
179841
ns196172
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
128041
ns97937
ns1.31
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144417
ns102750
ns1.41
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
106917
ns130333
ns0.82
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
151959
ns127709
ns1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
140874
ns132466
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5963081
ns5909094
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
236762
ns182122
ns1.30
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924583
ns1924333
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1920500
ns1920667
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1914229.5
ns1921792
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1928875
ns1912771
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
673452
ns659652
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
29935915
ns31062786
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
899671
ns1217372
ns0.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17333
ns17625
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17354.5
ns18666.5
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21208
ns21834
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17375
ns17125
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
108833.5
ns103789.5
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3415955
ns3441121
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
91100
ns75841
ns1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216917
ns229375
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
252646
ns217917
ns1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222166
ns226458.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
229125
ns215521
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
508535.5
ns496186
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
19323488.5
ns18765642
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
419764
ns473665
ns0.89
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24271
ns24313
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
30791.5
ns29875
ns1.03
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
29437.5
ns27375
ns1.08
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1584
ns1250
ns1.27
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16398
ns15897
ns1.03
batchedmm(16, Bsize=4)/forward/GPU/oneAPI
72518390
ns71655631.5
ns1.01
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
76093
ns87071
ns0.87
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4500
ns5375.5
ns0.84
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4916
ns5083.5
ns0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5125
ns5459
ns0.94
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4625
ns4834
ns0.96
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
204364
ns200684.5
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI
94073985
ns92849344
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
331675
ns389014
ns0.85
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222666
ns222083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220666.5
ns223166
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
225667
ns224916.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220583
ns227000
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
222506.5
ns219523
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7881934.5
ns7712821.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
267871
ns274002.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
495084
ns495292
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
511812.5
ns549771
ns0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
500854
ns507520.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
675750
ns497583
ns1.36
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1053634
ns1034369
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
42862742
ns42519004
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
780999
ns850318.5
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20375
ns19708
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20000
ns21375
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23875
ns22292
ns1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18792
ns24792
ns0.76
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
114286
ns111603.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3510843
ns3581394.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
89858
ns77006
ns1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212375
ns218812
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213041
ns213041.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214458
ns221958.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212541
ns250667
ns0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
727333.5
ns710892
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
24570511
ns24867084.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
469036
ns532655
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6666
ns5959
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6604.5
ns6917
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8750.5
ns8708
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6208
ns5917
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
137142
ns131648
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5605207
ns5786966.5
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
60974
ns65661
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9791
ns10584
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10084
ns10729.5
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10750
ns11541
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10750
ns10541
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
794651.5
ns772200
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
37034174
ns37330612
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
370101.5
ns385494
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4666
ns4833
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4708
ns6354.5
ns0.74
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7437.5
ns6604.5
ns1.13
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4917
ns5041
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
138544.5
ns133064
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5520602
ns5822443
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
59692
ns57140
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7209
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7166
ns7666
ns0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7791
ns8042
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7708
ns7500
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
755761
ns738153
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
37179182
ns40138762.5
ns0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
376523
ns395034
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14498417
ns14423167
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10124125
ns10121834
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10094833
ns7695041.5
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27748583.5
ns27731208
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532665
ns530060
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI
94795139
ns94502665
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
866850
ns400144
ns2.17
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46333437
ns46295271.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33447541.5
ns33585729.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33510458
ns26523271
ns1.26
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85445667
ns85105834
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2636151
ns2636621
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI
192783631
ns190779173
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
5189385.5
ns3293333
ns1.58
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66458
ns67125
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
65687.5
ns68791
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
70500
ns69875
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
66500
ns67541
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
118172.5
ns116341
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3662360
ns3481863
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
237313
ns238303
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
467958
ns467979.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
480333.5
ns468833
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
474916.5
ns479729
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
686583.5
ns467333.5
ns1.47
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
715446
ns704065
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26609747
ns26310960
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
655875
ns795648
ns0.82
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32877
ns32111
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1227269
ns1221683
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47579
ns47180
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8750
ns8375
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9208
ns9417
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9104.5
ns9584
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9750
ns8416
ns1.16
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
280778.5
ns277435.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21881943
ns20099617
ns1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
355484
ns375813.5
ns0.95
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9500
ns9459
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9500
ns9625
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9500
ns9708
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9500
ns9625
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23273
ns22950
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
1862112.5
ns2089156.5
ns0.89
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
200655
ns212492
ns0.94
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
50209
ns50167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
50250
ns50292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
50500
ns50541
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
72375
ns50375
ns1.44
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
278469.5
ns272026
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
13204061
ns11125411
ns1.19
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
491037
ns611216
ns0.80
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
54917
ns55250
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
55667
ns55917
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
55584
ns54375
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56000
ns56041
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28169
ns27749
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1174691
ns1229944.5
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
203240
ns214587
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
518854
ns485479
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
500625
ns496084
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
497750
ns537000.5
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
643417
ns461291.5
ns1.39
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
238777
ns237315
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
31628121.5
ns32908722.5
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
758938
ns839118
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
655042
ns651166.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
613083
ns645917
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
652541
ns662000
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
678416.5
ns641417
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192069
ns190601
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8140636
ns8668801
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
269704
ns229822
ns1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2167104.5
ns2241917
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2233125
ns2232875
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2241292
ns2250458.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2230208.5
ns2234417
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
929752.5
ns914905
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
55073105
ns49141404
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1217770.5
ns1359913
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19500
ns21375
ns0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19208.5
ns20938
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23542
ns22583
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20000
ns19167
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
111306
ns109650
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3589059.5
ns3622083
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
91551
ns75660
ns1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220459
ns218833.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226458
ns221084
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
223104.5
ns235688
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219708
ns221125.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
714110
ns709252
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26626181
ns25088612.5
ns1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
487481
ns553695
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23491
ns23372.5
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1232519
ns1180770.5
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
43771
ns49900
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9417
ns9874.5
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9291.5
ns9708
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9708
ns10229.5
ns0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9646
ns9334
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
261581
ns259739
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
23734390
ns25804898
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
381618
ns401304
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8917
ns9541
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7583
ns9187.5
ns0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11854.5
ns10833
ns1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9042
ns8875
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
115935.5
ns113457.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3441325
ns3378008
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
70456.5
ns69850
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8125
ns7625
ns1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7542
ns7937.5
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8000
ns8375
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7292
ns7541.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
484010
ns474853
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
17813154.5
ns17576598
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
302215
ns322123
ns0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1417
ns1500
ns0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1667
ns1666.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1959
ns2187.5
ns0.90
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1500
ns1542
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
20030
ns19317
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1146657
ns1172938.5
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
184144
ns192092
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3708
ns3542
ns1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3625
ns3625
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3833
ns3833
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4917
ns3500
ns1.40
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
213101.5
ns209093.5
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10511562.5
ns10006581.5
ns1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
524324.5
ns581056
ns0.90
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148729
ns148416
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
128917
ns127541.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
129917
ns107500
ns1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
235541
ns225042
ns1.05
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
22778
ns22459
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1179919.5
ns1201113
ns0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
46868
ns37415.5
ns1.25
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
143645.5
ns143666.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
130875
ns110916
ns1.18
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
138417
ns100875
ns1.37
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
290021
ns250834
ns1.16
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
211960
ns206476
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10741797
ns10778609
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
223578
ns220822
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7334
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns6000
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958.5
ns5375
ns1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10041
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33236
ns33038
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1203805
ns1161067.5
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
57207
ns48271
ns1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221249.5
ns220021
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
238542
ns227708
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
264500
ns243333
ns1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213250
ns212750
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
259447
ns256906
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27707385
ns27263274.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
530542
ns522055
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13209
ns12333
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12166
ns13020.5
ns0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13584
ns14333.5
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12667
ns12917
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
135078
ns131126.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5685986
ns5521631
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
227730.5
ns235402
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23917
ns24520.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24083.5
ns24187
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24750
ns25354.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
30146
ns23625
ns1.28
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
833527
ns816371.5
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
39963084.5
ns39369345
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
615374.5
ns684572
ns0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9271
ns9208
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9541
ns10042
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10375
ns11167
ns0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9250
ns9625
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
119628
ns116949.5
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3356719.5
ns3478536
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
74940
ns70201
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14041
ns14250
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13958
ns13771
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14750
ns15416
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13459
ns13958
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
638262
ns627909.5
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
22466836
ns21438120
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
344824
ns377354
ns0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9666.5
ns8958
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9208
ns10437.5
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10959
ns11750
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9083.5
ns9166
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
118521
ns115964
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3571671.5
ns3401614.5
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
79399
ns72371
ns1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13416
ns13208
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12416
ns12854
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13479.5
ns13958
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12708
ns12416
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
530027
ns516349.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19360325
ns19477250
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
317163
ns339683.5
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
30896
ns30291.5
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
33813
ns34041.5
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
32249.5
ns30042
ns1.07
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1875
ns2083
ns0.90
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16425
ns16187
ns1.01
batchedmm(2, Bsize=128)/forward/GPU/oneAPI
76985679
ns75928615
ns1.01
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
76663
ns78561
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5417
ns5291.5
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5000
ns5499.5
ns0.91
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5479.5
ns5375
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6270.5
ns6375
ns0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
138278
ns135964
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI
109824422.5
ns110752109
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
340566
ns382864
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns291
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns417
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
ns292
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
25574
ns24855
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1142450
ns1239551
ns0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
45666
ns48910
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6459
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6604
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6791.5
ns7208.5
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6458.5
ns6125
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
185923.5
ns180794
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
22900684.5
ns24106911.5
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
365402.5
ns390139
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2084
ns2000
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2084
ns2125
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2125
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2000
ns2042
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
26453
ns25818
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1207656
ns1193002
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
203645.5
ns219547
ns0.93
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18041
ns17500.5
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17166.5
ns17833.5
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17750
ns18437.5
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
23458.5
ns17500
ns1.34
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
268326
ns264425
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
24994377.5
ns24505308
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
600702.5
ns705652
ns0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
147875
ns178208
ns0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
155437.5
ns165145.5
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
155125
ns179042
ns0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
151708
ns151292
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190890.5
ns187400
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7974634
ns7801096
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
271146.5
ns191502
ns1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1321937.5
ns1317104
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1330625
ns1320125
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1308375
ns1331937
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1285166
ns1318125.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
867140
ns859849
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
45331705.5
ns43918638
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1006962
ns1005140
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25500
ns24084
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23542
ns24708
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28708.5
ns28063
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24416.5
ns26291.5
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
226899
ns226248
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7680667
ns8086333
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
128029
ns115141
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
125062.5
ns160416.5
ns0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
165729.5
ns132958
ns1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
125854.5
ns127937.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
180062
ns124437.5
ns1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
998018.5
ns978646
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44411227
ns45755327
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
568743
ns587856
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns292
ns0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23453
ns22971
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1190116
ns1181802
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
44533
ns48630
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6895.5
ns6333
ns1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6729.5
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6958
ns7291
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6520.5
ns6541.5
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
201834
ns197400
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23542895
ns24703832
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
372536
ns392804
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5645.5
ns5584
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5375
ns6958
ns0.77
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7979
ns8021
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5166
ns5875
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
139838.5
ns135487.5
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5619575.5
ns5687030
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
229750
ns235072
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9958
ns10083.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10042
ns10458.5
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10500
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10854.5
ns9833.5
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
866511
ns841087.5
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
43130156
ns41023608
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
603858
ns675251.5
ns0.89
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns708
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
708
ns667
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
750
ns750
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns708
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22827
ns22206
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2079377
ns2616381.5
ns0.79
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
202368
ns209832
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4875
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4833
ns4917
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5125
ns5208
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6291
ns4875
ns1.29
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
222098
ns215367.5
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9952955
ns11863776
ns0.84
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
471721
ns591926
ns0.80
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8750
ns7729.5
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7834
ns7958
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9375
ns9833
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7646
ns7750.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
117939.5
ns115622
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3568146
ns3536818
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
74409
ns71851
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
ns8542
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8583
ns8687.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8875
ns9520.5
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8083
ns8520.5
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
568724.5
ns552863
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
20842961
ns20606879
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
335106
ns343673.5
ns0.98
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126042
ns127854
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
129208
ns128834
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
129542
ns96354
ns1.34
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
180792
ns183167
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46423
ns45982
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/oneAPI
72616088
ns72286847
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
101850
ns95811
ns1.06
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
315875
ns330459
ns0.96
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
334166.5
ns332334
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
323291.5
ns197417
ns1.64
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
609395.5
ns571042
ns1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
187684
ns183822.5
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI
93899553
ns93731117
ns1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
405833.5
ns473290
ns0.86
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397500
ns397125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287979.5
ns288229
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288375
ns215375
ns1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756000
ns756375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43964
ns43348
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1424885
ns1384285
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
79439
ns79971
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1461000
ns1459375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1133834
ns1132396
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1129645.5
ns862770.5
ns1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2449292
ns2442500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
254140
ns239777
ns1.06
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
11042616
ns13231788
ns0.83
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
254646
ns351138.5
ns0.73
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
626500
ns647458
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
657208.5
ns649666
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
649750.5
ns655021
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
642417
ns641583.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
185720.5
ns178508
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8332264.5
ns8381344
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
264649
ns240322
ns1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2452625
ns2454875
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2465208.5
ns2450333
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2459375
ns2461646
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2376375
ns2458334
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
949649
ns938639.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
53455476.5
ns52014786
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1323598
ns1448719
ns0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
32458
ns33146
ns0.98
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36521
ns35708
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34833
ns32000
ns1.09
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
959
ns875
ns1.10
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15902
ns15683
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/oneAPI
73782106
ns73122838
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
74499.5
ns71645.5
ns1.04
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3125
ns3187.5
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3250
ns3458
ns0.94
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3375
ns3541
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3062.5
ns3083
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
137187.5
ns134592
ns1.02
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI
98822060.5
ns97284653
ns1.02
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
314258
ns337323.5
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
436500
ns439375
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
438625
ns440583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
438791
ns431375
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
445917
ns450375
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42826
ns42224
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1503651
ns1392161
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
374379.5
ns237893
ns1.57
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4140000
ns4138958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4271375
ns4247291.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4270687.5
ns4262792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5468750
ns4028416.5
ns1.36
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
236201.5
ns233746
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
36248116
ns36534446
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1135862
ns1234322
ns0.92
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3709
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3791
ns3917
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3958
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3916
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34158
ns34090
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1274307
ns1239089
ns1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
41117
ns40520
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15375
ns15291
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15334
ns15958
ns0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15500
ns15750
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15250
ns15667
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
255579
ns251120.5
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8309435
ns8891050
ns0.93
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
158606
ns171192
ns0.93
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404792
ns404125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295917
ns295250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295958
ns220625
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
759750
ns760666
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113245
ns113428
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1043498
ns1051037
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
91962
ns89110.5
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1482854
ns1479125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1158625
ns1156270.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1150334
ns886792
ns1.30
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2466708
ns2464333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
236768.5
ns227639.5
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
9725420.5
ns12228324
ns0.80
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
298578
ns352474
ns0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
25569
ns24868
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1198679
ns1263047
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
202679
ns214292
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8083
ns7541
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns7917
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8375
ns8250
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8437.5
ns7667
ns1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
207068.5
ns202491
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25228707
ns25565257
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
593474
ns687187
ns0.86
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
829375
ns830417
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
617667
ns617334
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
618667
ns467125
ns1.32
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1544417
ns1539875
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130866
ns130469
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI
74874331.5
ns74138060
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
211214
ns167662
ns1.26
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2686104.5
ns2680895.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1994542
ns1979750
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1998375
ns1532167
ns1.30
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4960479
ns4935708
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
234509
ns233179
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI
102181218
ns101283369
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
831293.5
ns855698
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
250
ns292
ns0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32562
ns31956
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1276503
ns1162026.5
ns1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
48691
ns49090
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6333
ns6187
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6770.5
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6667
ns7042
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6104.5
ns6375
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
227701
ns217529.5
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21756022
ns22613407
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
346728
ns355723.5
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1760625
ns1750042
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1749875
ns1774250
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1744292
ns1759417
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1755166
ns1775625
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189332
ns177451
ns1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7765672
ns8059544
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
413433
ns355403
ns1.16
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4360416
ns4352125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4366917
ns4360770.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4349104
ns4377083.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5705104
ns4357583
ns1.31
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
849205
ns843625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
48802559
ns47645217
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1205562.5
ns1390698
ns0.87
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
9604
ns14562.5
ns0.66
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6916
ns9667
ns0.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
8208
ns8292
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6854
ns6666.5
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22924.5
ns22207
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1184238.5
ns1231018
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
46437
ns37720
ns1.23
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
50604.5
ns64458.5
ns0.79
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
52166
ns70792
ns0.74
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
45458.5
ns45708
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
33312.5
ns49521
ns0.67
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
211538
ns204835
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10576796.5
ns10627124.5
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
226508
ns233202
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21646
ns21292
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
26083.5
ns24770.5
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
24958.5
ns22334
ns1.12
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5291.5
ns7416
ns0.71
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18121
ns17630
ns1.03
batchedmm(2, Bsize=512)/forward/GPU/oneAPI
88732630
ns87889435
ns1.01
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
73668
ns90301
ns0.82
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12125
ns12187
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10667
ns10625
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10833
ns9750
ns1.11
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18042
ns18041.5
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
221707
ns216733.5
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI
148404121
ns151365483
ns0.98
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
322703
ns384574
ns0.84
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
405917
ns405417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
296791.5
ns297333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
297167
ns223417
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
756709
ns762625
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46696
ns46368
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1393570.5
ns1390104
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
90770
ns90091
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1487375
ns1487792
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1163500
ns1159187.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1157209
ns892375
ns1.30
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2472417
ns2470895.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
283340.5
ns267416
ns1.06
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
11947586
ns13880635
ns0.86
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
269032
ns378473.5
ns0.71
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
436458
ns436458
ns1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
443270.5
ns438916
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
440750
ns431708
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449000
ns450167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
53940
ns53539
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1027722
ns1016245
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
323133
ns235682
ns1.37
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4138541
ns4143041
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4268354.5
ns4257999.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4258750
ns4266292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5475229.5
ns4032437.5
ns1.36
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
255597
ns253837
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31502698.5
ns31122046.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1132896.5
ns1206682
ns0.94
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9333
ns9208
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
8000
ns8167
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
8000
ns7208
ns1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
13250
ns13416
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
23885
ns23370
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
1973050
ns2190811
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
202528
ns212852
ns0.95
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
49625
ns49416
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
49667
ns50083
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
49583
ns49541
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
71667
ns49667
ns1.44
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
336641
ns331181
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
13058534
ns12227793
ns1.07
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
508895.5
ns657676
ns0.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
108270.5
ns123458
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
86167
ns85271
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86500
ns127292
ns0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
146083
ns108541.5
ns1.35
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192063
ns191180.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5750624
ns6110005
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
267851
ns200667
ns1.33
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2018917
ns2014999.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2016937.5
ns1877583
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2011375
ns2016083
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2024000.5
ns2015916
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
511598
ns510301
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
30563079
ns27606531
ns1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
860237
ns943229
ns0.91
This comment was automatically generated by workflow using github-action-benchmark.