This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: urgent patch for reactant breakage
- Loading branch information
Showing
3 changed files
with
5 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.3.1" | ||
version = "1.3.2" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
@@ -71,7 +71,7 @@ LinearAlgebra = "1.10" | |
LoopVectorization = "0.12.171" | ||
LuxCore = "1" | ||
MKL = "0.7" | ||
MLDataDevices = "1.1.1" | ||
MLDataDevices = "1.2" | ||
Markdown = "1.10" | ||
NNlib = "0.9.24" | ||
Octavian = "0.3.28" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e6dd65c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
e6dd65c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/116561
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
e6dd65c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6104.5
ns5291
ns1.15
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6125
ns7375
ns0.83
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7166
ns7687
ns0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6042
ns6958
ns0.87
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
105660
ns111876
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2718405
ns2746993
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
401954
ns414534
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9979
ns10041.5
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10000
ns10125
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10167
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10063
ns10000.5
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
495391
ns497187
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
16604973
ns17740695
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
682487
ns664206
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1812
ns1479.5
ns1.22
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1708
ns1459
ns1.17
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1667
ns1875
ns0.89
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
2104
ns1583.5
ns1.33
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
20067
ns19698
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1353411.5
ns1364290
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
31000
ns31630
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4041
ns4083
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3625
ns4416
ns0.82
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4542
ns4125
ns1.10
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4250.5
ns3291
ns1.29
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
133056
ns130509
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
8127591.5
ns9003854
ns0.90
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
146031
ns149371
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58042
ns57958
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39959
ns46167
ns0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39792
ns46542
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83333
ns82541
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36918.5
ns36502
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
567954
ns564405
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
76900
ns81146
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030417
ns2037625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2081666.5
ns2078416
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2084437
ns2083625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2002333
ns2000875
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
220443
ns216924
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7915184
ns7524779
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1433294
ns1725786
ns0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146500
ns152667
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
164208.5
ns168375
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150937.5
ns152437.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
189709
ns193708
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166381.5
ns167125
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8085655
ns7312313
ns1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
187972
ns213517
ns0.88
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1113437
ns1113104.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1109375
ns1116334
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1117083.5
ns1115000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1112084
ns1106770.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
646028
ns628256
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
34072899
ns32195104
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1026270
ns1026645
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6250.5
ns5166
ns1.21
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4917
ns4792
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5562.5
ns5917
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4708
ns4542
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
82687
ns82840
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5327028.5
ns5343488
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
59005.5
ns67740
ns0.87
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8958
ns8708
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8833
ns8500
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9167
ns8708.5
ns1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8875
ns8542
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
554954
ns548688
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
33304840
ns33264338
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
384224
ns384004
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18208
ns17709
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22250
ns18625
ns1.19
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20500
ns21375
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17833.5
ns19583.5
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
62129
ns61770.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3094668.5
ns3180292.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
77001
ns75881
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
234334
ns212208
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229500
ns219208.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
224000
ns214875
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219041.5
ns219958
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
329979.5
ns324445
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
12487876
ns13687318.5
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
465894
ns466224
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
584
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
708
ns625
ns1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns958
ns0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
645.5
ns667
ns0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
19107
ns18677
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1159471
ns1223151.5
ns0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32171
ns31400
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1458
ns1416.5
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1334
ns1417
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1542
ns1583
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1416
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
114910.5
ns114301
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8672258
ns8986516.5
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
124841
ns135771
ns0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7417
ns7292
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5354.5
ns6125
ns0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5458
ns6125
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns9958
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23654
ns23537.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1263771.5
ns1250837
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48941
ns47255.5
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
256833
ns220688
ns1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
269917
ns235896
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
269000
ns229416
ns1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213417
ns255458.5
ns0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
184585
ns180772.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
32350348
ns30816816.5
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
588346
ns642475
ns0.92
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4084
ns4042
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4084
ns4084
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4166
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4083
ns4083
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23536
ns22833
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2170016
ns2018204
ns1.08
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
47570
ns46910
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16500
ns16541
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16667
ns16834
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17042
ns17084
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16500
ns16833
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
185621
ns182565
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
9999690
ns10544759
ns0.95
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
171902
ns171221
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
493500
ns493041
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
313000
ns385667
ns0.81
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
312583
ns386125
ns0.81
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
847333
ns847250
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113322
ns112997
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
396951
ns408156.5
ns0.97
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
242543
ns242212
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2121250
ns2093437.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1582666
ns1861958
ns0.85
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1584000
ns1876833
ns0.84
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3043250.5
ns3143021
ns0.97
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
230454
ns228687
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
10440097.5
ns10334254.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
746137
ns743867
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7000.5
ns6167
ns1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6479.5
ns6625
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6708
ns8333.5
ns0.80
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6458
ns7375
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
83715.5
ns83073.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5295782.5
ns5613807.5
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
59480
ns65621
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12396
ns11042
ns1.12
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11500
ns10958.5
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12104.5
ns11645.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11333.5
ns12812.5
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
600141.5
ns595390
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
39003334
ns37940094
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
410324
ns408370.5
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23331
ns23168
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2292659.5
ns2210796
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
51010
ns46950
ns1.09
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2209
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2166
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
233774
ns213006.5
ns1.10
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
11024589.5
ns11081491
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
182892
ns181582.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8417
ns8834
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9563
ns8834
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10021
ns10021.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8583
ns8500
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
110268
ns99705.5
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3046756
ns3198646
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
71861
ns72221
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18042
ns18834
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18416.5
ns17479
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19083.5
ns18458
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18187.5
ns18895.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
612118
ns566743
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
16020202
ns18116750
ns0.88
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
379663
ns377315
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns500
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns584
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns541
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34018
ns33362
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1157894
ns1254721
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
48210
ns46210
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9000
ns8916.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns9479.5
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9541.5
ns9667
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9187.5
ns9250
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
263691
ns255341
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
17973778.5
ns18499322
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
363818.5
ns366854.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
399291
ns398042
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215375
ns288250
ns0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215291
ns288042
ns0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756375
ns755958
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111229
ns112430
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
340432
ns338275
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
74750
ns74831
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1397958
ns1408834
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
860270.5
ns1134937.5
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
859500
ns1133167
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2356875
ns2438875
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
199160
ns198896
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
10127007.5
ns10071273
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
325203
ns320874
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7458.5
ns7270.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7583
ns7583
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8250
ns8854.5
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7188
ns6917
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
138757.5
ns136778.5
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5504110
ns5388548.5
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
59831
ns66211
ns0.90
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12708.5
ns14833.5
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16250
ns15791
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16708
ns14229.5
ns1.17
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
12250
ns14709
ns0.83
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
903568
ns897273
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
41185117.5
ns42742507.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
426569.5
ns425150
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25146
ns27562.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
29875
ns25583
ns1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
29563
ns30228.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
28708
ns28854
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
186563
ns185009
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7437253
ns7764877.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
112512
ns115321
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
158917
ns106417
ns1.49
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
155729
ns151645.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
147416.5
ns153166
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
143875
ns150583
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1016648
ns996849
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
42077608
ns42717639
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
580615
ns587287
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74583
ns74375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75291
ns75999.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
84145.5
ns80375
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
80750
ns87000
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
192007
ns189182
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7668059
ns7755648
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
121601
ns128191.5
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
303292
ns295291
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
318458
ns319708
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
310583.5
ns247791.5
ns1.25
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
286500
ns273792
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1028367
ns1010749
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
39874880
ns41903716
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
694997
ns697424
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13208
ns13500
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13209
ns13209
ns1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14416.5
ns14167
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12583
ns13125
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
137690
ns136045
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5469791
ns5580516
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
235293
ns233743
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25916.5
ns26604
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26042
ns26187.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27125
ns26625
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27750
ns28208.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
917440.5
ns900814
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
41204925.5
ns41017058
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
677137
ns690428
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11021.5
ns12000
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12104
ns11896
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12667
ns12459
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11084
ns10833
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
118805.5
ns117378.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3655788
ns3507903
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
238257.5
ns236503
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
22625
ns22417
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
23354.5
ns22958
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
23500
ns23875
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
23125
ns22583
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
678428
ns660570
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
21333771
ns20618992
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
679757
ns675828
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66333
ns64666
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
64583.5
ns67083.5
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68500
ns66167
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
64792
ns66875
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101302
ns100086
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3333715
ns3307399.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
234893
ns234107.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
486625
ns465000
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
486083
ns466167
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
478646
ns468625
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
464625
ns503833
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
490708
ns483663
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20883033
ns21199224
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
709767
ns709238
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7562.5
ns7562.5
ns1
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7875
ns8083
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8500
ns8250
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7292
ns7083.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
136584.5
ns134375
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5472736.5
ns5976128
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
57580
ns65651
ns0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14459
ns14041.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14417
ns13125
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14625
ns14479
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16625
ns14625
ns1.14
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
882666
ns872555
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
38063609
ns40293875
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
396884
ns400284
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6159458
ns6157812.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3225666
ns6375333.5
ns0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
3225333
ns6376917
ns0.51
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11918958
ns11913125
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
345241.5
ns346601.5
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/oneAPI
49786188
ns53593217
ns0.93
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
301508
ns320474
ns0.94
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19144854.5
ns19110896
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11111958.5
ns19977396
ns0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
11126458
ns19903104
ns0.56
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36537562.5
ns36496187.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1009913
ns1012562
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI
79258291
ns77852170.5
ns1.02
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1164436.5
ns1157544
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1083
ns1000
ns1.08
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1125
ns1000
ns1.13
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1042
ns1042
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1041
ns958
ns1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23469
ns22944
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2016774.5
ns2044697
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209702
ns207642
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4000
ns3917
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4000
ns4000
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4000
ns4042
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4000
ns4000
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
270402
ns269119
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
12755226.5
ns11661739
ns1.09
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
624936
ns625997
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7896
ns8437.5
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7624.5
ns8895.5
ns0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9041
ns9562
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8792
ns8375
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
116551
ns113535
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3381124
ns3443497.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
67301
ns68271
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
12375
ns11896
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12354.5
ns12021
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
13458
ns12792
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11521
ns12583
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
608379
ns597497
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23562453
ns21602127
ns1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
355544
ns354444
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
375
ns291
ns1.29
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22683
ns22361
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
2066903.5
ns1916584
ns1.08
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
48621
ns46890
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns3000
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns2958
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3458
ns3333
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2917
ns2917
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
194883.5
ns193738
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9192914
ns9462333
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
160881
ns156212
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11833
ns10583
ns1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11771
ns11875
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12666
ns13083.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11708
ns12062.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
114987
ns113976.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3433103
ns3275659
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
237082
ns236063
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22270.5
ns21833.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23625
ns22145.5
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23145.5
ns23875
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
22417
ns22333
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
559620
ns547934
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
22188886
ns20491745
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
657467.5
ns654438
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4417
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4417
ns4458
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
23954
ns23860
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2173917
ns2144860.5
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
47821
ns49061
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16375
ns16375
ns1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16375
ns16666
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16500
ns16666
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16250
ns16541
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
319321
ns316685
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
12527061
ns12062386.5
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
205182
ns209243
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2209
ns2000
ns1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2208
ns2084
ns1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2209
ns2209
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2084
ns2208
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34739
ns34477
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1189190
ns1229094
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
207283
ns203202
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17729.5
ns18604
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
19291.5
ns18708
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
19125
ns18833.5
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17500
ns21208.5
ns0.83
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
284503
ns282309
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
23993645
ns21098361
ns1.14
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
683047
ns686013
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
58771
ns59292
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
61500
ns64917
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
62167
ns66458
ns0.94
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51041
ns51625
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66683
ns66488
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI
87104215
ns88258686
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
96771
ns118491
ns0.82
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
189875
ns175916.5
ns1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
148499.5
ns153479
ns0.97
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
141104
ns160333.5
ns0.88
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
271312
ns224542
ns1.21
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
208001
ns208290.5
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI
151028820
ns149475929.5
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
556366
ns608982
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
83188
ns81083
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
116270.5
ns83270.5
ns1.40
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
87667
ns124833.5
ns0.70
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
88791
ns85395.5
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190555.5
ns192029
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5413890
ns5900244
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
168726.5
ns202972.5
ns0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1885521
ns1881145.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1906833
ns1912667
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1922167
ns1916083
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1922208.5
ns1849250
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
505315
ns499932
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
25531763
ns26802673
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
918625.5
ns1066872
ns0.86
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21748.5
ns21422.5
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2173850
ns2063314.5
ns1.05
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
40920
ns41850
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
243459
ns241279.5
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
10131447
ns9975087
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
176522
ns180262
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
11042
ns8166
ns1.35
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9834
ns10292
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11166.5
ns11208
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9417
ns11042
ns0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
115799
ns113299.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3304623
ns3500381.5
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
235862
ns233333
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9916
ns9917
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11000
ns9834
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10437.5
ns11458
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9625
ns10417
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
492386
ns484445
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
19770322
ns18749564
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
634956.5
ns627157
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58666
ns58375
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39500
ns47209
ns0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39333
ns46833
ns0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83750
ns82625
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38435
ns38276
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1328229
ns1341940
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79261
ns75211
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1932333.5
ns1836770.5
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1949916
ns1985937.5
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1971250
ns1978479
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1900375
ns1854291.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
211772
ns209126
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
38871327.5
ns33357124
ns1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1010796
ns1011361
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
276583
ns267437.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268541
ns270417
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270583.5
ns270625
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
269542
ns268604.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
196349
ns193011.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7271576
ns7986425
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
281833
ns282544
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
662208
ns588125
ns1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
709250
ns688229.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
685042
ns688292
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
690770.5
ns593500
ns1.16
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
994716
ns985216
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45571380
ns43272459
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
902690
ns911561
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2181125
ns2205542
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2197167
ns2194083.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2214166
ns2213708
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2217666
ns2176167
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
156988.5
ns153511
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8339035
ns8157200
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
421825
ns445380
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5477291.5
ns5508979.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5530250
ns5521979
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5519334
ns5474458
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5543313
ns5531895.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
938151
ns925959.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
53140221
ns50527002
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1722729
ns1539832.5
ns1.12
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
478167
ns478666
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
257208
ns346145.5
ns0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
257292
ns346083
ns0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
908750
ns909333
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46532.5
ns46203
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
824635.5
ns382606
ns2.16
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
246353
ns242913
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2133375
ns2111749.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1588083
ns1861166.5
ns0.85
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1587417
ns1866541
ns0.85
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3041125
ns3130375
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
256675
ns258500
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
12946074
ns15052922
ns0.86
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
775668
ns773039
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns58125
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39625
ns46334
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39375
ns46167
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83500
ns82542
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
27930.5
ns27952
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1395751
ns1310631
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
73260
ns73681
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017271
ns2039458
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2083062.5
ns2089729.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2080584
ns2087020.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1994312.5
ns1978124.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
224353
ns221951
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
36423844
ns36802380
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1036751
ns1041362
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58292
ns58417
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39917
ns46958
ns0.85
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39750
ns46417
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83458
ns82334
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48290
ns47697.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
796160
ns816463
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
69781
ns71371
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920208
ns1926479.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1966666.5
ns1973250
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1956354.5
ns1973167
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1892750
ns1898833
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
231868
ns228428
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
17847168
ns17513790
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
917180
ns1026836.5
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
33423
ns33167
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1183579
ns1174385.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
47961
ns48501
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6750
ns6416
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns6834
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6916
ns7250
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns6333
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
205663
ns199581
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
20110108
ns19880225
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
364303.5
ns363764
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31975
ns32517
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1205451
ns1265101
ns0.95
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
40370
ns37771
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns3417
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3625
ns3375
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3209
ns3167
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3250
ns2792
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
182875
ns182053
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7414501
ns9212477.5
ns0.80
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
146242
ns158127
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
468625
ns460687.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
492396
ns478208.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
470250
ns500000
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
466354
ns470937
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134348
ns134071
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5645574.5
ns5855749
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
349229
ns366189
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4091499.5
ns4078667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4078417
ns4067771
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4081499.5
ns4080625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4051646
ns4056354
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
673570.5
ns664164.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33417417
ns31731318
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1482381
ns1467136
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49972812
ns49955792
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
26026291
ns35488958
ns0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
25991500
ns35531584
ns0.73
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97072458
ns97090437.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1599973.5
ns1601101.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI
56493457
ns55729446
ns1.01
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1057326.5
ns1044391.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154932104.5
ns154677542
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
89308062.5
ns112413020.5
ns0.79
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
88895875
ns112347584
ns0.79
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
295925812.5
ns295444937.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6475879
ns6489665.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI
126118434
ns124609188
ns1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5578679
ns5586056.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
18917
ns19520.5
ns0.97
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
16000
ns17458
ns0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
13708
ns17417
ns0.79
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
16437.5
ns15750
ns1.04
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19926
ns19821
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1208215.5
ns1180885
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27550
ns26420
ns1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
10937
ns10959
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7770.5
ns9125.5
ns0.85
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
7708
ns9084
ns0.85
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17291
ns17167
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
243495.5
ns242736.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
9761127
ns10064346.5
ns0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
147112
ns149096.5
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8750
ns8208.5
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
9708.5
ns10687.5
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10667
ns10604.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8646
ns8959
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
119480.5
ns116211.5
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3419715
ns3615906
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237342
ns234342
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10312.5
ns10292
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11041
ns10209
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10667
ns11292
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10770.5
ns9437.5
ns1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
585828
ns575593
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23041802
ns22955140
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
655982
ns652487
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10020.5
ns9250
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9333
ns9875
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10396
ns11041.5
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9500
ns10333.5
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
115334.5
ns113252
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3516489.5
ns3518835.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
70430.5
ns72611
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
15292
ns16542
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
17375
ns15833
ns1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
15542
ns15750
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16250
ns16708
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
558960.5
ns548922
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19523647
ns19847936.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
346234
ns343724
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
584
ns584
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
583
ns584
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
33420.5
ns33202
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1147462
ns1238450.5
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
208233
ns204092
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8875
ns8916
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8917
ns9292
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9375
ns9958
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns12292
ns0.66
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
223663.5
ns220226.5
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
20525508
ns21905010
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
660067.5
ns657387
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
15833
ns17625
ns0.90
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
14958
ns15958
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
13166.5
ns15209
ns0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12042
ns11291
ns1.07
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
20351
ns19970
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1180410
ns1162812
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
188642
ns188782
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
35334
ns35458
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
35396
ns35562
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
35354.5
ns35645.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
35459
ns35542
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
258908.5
ns255892
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11202018
ns10845224.5
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
593676
ns591957
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
453584
ns448958
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
448854.5
ns453750
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
458979
ns492166
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
463708
ns453875
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194627
ns193846
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5838406
ns6007739
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
361629
ns367744
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4069291
ns4069208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4057666
ns4054291.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4066166.5
ns4049270.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4041000
ns4057500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
509044
ns505408
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28858353.5
ns37330396
ns0.77
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1369935
ns1362695
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
786136291
ns779601166
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
416023146
ns542496166
ns0.77
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
416822792
ns539989666
ns0.77
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1513689687.5
ns1569938708
ns0.96
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22552578.5
ns22536712.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/oneAPI
184723934
ns187859757.5
ns0.98
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14622705
ns14732780
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2527797917
ns2505560125
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1507508250
ns1783555333
ns0.85
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1513719042
ns1792629375
ns0.84
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4744640792
ns5216869375
ns0.91
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
119636395
ns118336848
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI
977281580
ns935397218
ns1.04
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87882829
ns88936600
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
78083.5
ns78854.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79375
ns76791
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79292
ns79000
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77417
ns79354
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
195081
ns190682.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
5838980
ns5473671
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
106236.5
ns108351
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
291584
ns294125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232333.5
ns289958
ns0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
275646
ns261417
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
268875
ns238520.5
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
999623
ns986968
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
45105321.5
ns46526863
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
637827
ns636237
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199983542
ns199699479
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
103920208
ns139060584
ns0.75
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
103978083
ns139030750
ns0.75
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389299042
ns388620875
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5843844.5
ns5814292
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/oneAPI
83097022
ns80005938
ns1.04
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3606828
ns3574368
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
620238542
ns621021958
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
353393416.5
ns439183125
ns0.80
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
352881646
ns439329875
ns0.80
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1193561791
ns1194801708
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26518526
ns26594102
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI
284158390
ns295168041
ns0.96
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
22094133
ns22131978
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7291
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns6291
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns6125
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9875
ns9959
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26733.5
ns26445
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1222239
ns1270170
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46490
ns48281
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220979
ns216209
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224417
ns220416.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
223500
ns222625
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
207583
ns219125
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
215495
ns214078.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
20243976
ns29452909
ns0.69
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
519876
ns522765
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10312.5
ns9416.5
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9479
ns9041
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9895.5
ns9833.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9937.5
ns10791.5
ns0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
113347
ns110026
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3285929
ns3375913.5
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
71090
ns72150
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9604
ns10854.5
ns0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11437.5
ns7750
ns1.48
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10042
ns9708
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10145.5
ns11250
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
491382
ns484552.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
19298925
ns18934737
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
314464
ns313639
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
708
ns459
ns1.54
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
709
ns500
ns1.42
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns667
ns0.81
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24930.5
ns24574
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1200126
ns1221655.5
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48911
ns46721
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
12375
ns12292
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14958
ns8896
ns1.68
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9000
ns10583
ns0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9666
ns13666
ns0.71
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
246496
ns243553
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
26158764.5
ns23025269
ns1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
386995
ns386704
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
110750
ns111083
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
90417
ns102541.5
ns0.88
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
88125
ns103792
ns0.85
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
155146
ns155083.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
23300
ns22624
ns1.03
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
807719.5
ns791164
ns1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190702
ns191512
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
534625
ns567500
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
562249.5
ns573417
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
542812.5
ns549583.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
535250
ns537292
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
217557.5
ns213930
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11916876
ns11700863.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
610017
ns608337
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5375
ns5750
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
6709
ns5167
ns1.30
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7375
ns7667
ns0.96
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6520.5
ns4563
ns1.43
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17156
ns16559
ns1.04
batchedmm(16, Bsize=32)/forward/GPU/oneAPI
73303180
ns73950991
ns0.99
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
71171
ns80275.5
ns0.89
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12833
ns11958
ns1.07
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11375
ns10750
ns1.06
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10145.5
ns11583
ns0.88
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16708.5
ns18167
ns0.92
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
204040
ns203546
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI
100040684
ns98437217
ns1.02
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
364443
ns367244
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
38834
ns38958
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50542
ns51125
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
51417
ns52458
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13854.5
ns13770.5
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA
21940
ns20666.5
ns1.06
batchedmm(16, Bsize=128)/forward/GPU/oneAPI
77344300
ns77382892
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
84996
ns87361
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36917
ns36416
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
31042
ns30770.5
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
28125
ns35250
ns0.80
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
77979.5
ns58812.5
ns1.33
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
180753
ns180756
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI
114626839.5
ns110794943
ns1.03
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
397599
ns408754
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1854.5
ns1750
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1958
ns1875
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2209
ns2125
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1666.5
ns1833
ns0.91
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19375
ns19320
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1256628
ns1202099
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
27490
ns33080
ns0.83
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2208
ns2395.5
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2333
ns0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2416
ns2375
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2375
ns0.89
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
194356
ns193868.5
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
9435961.5
ns9197836
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
136311
ns137016.5
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5166.5
ns5292
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5520.5
ns5750
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6396
ns6208
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5187.5
ns5958.5
ns0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
140899.5
ns139204.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5837469
ns5728892
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
57270
ns69071
ns0.83
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9020.5
ns8667
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9437.5
ns8625
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns8792
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8417
ns9145.5
ns0.92
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
815402.5
ns809144
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
41500822.5
ns39925856
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388544
ns387074
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
55083
ns55125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
54292
ns55958
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54375
ns56042
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56417
ns56208
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36794
ns35813.5
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1165062
ns1246242
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
206892
ns202752
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
478792
ns489125
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
535375
ns532541.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
496937
ns505645.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
474395.5
ns470521
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
257604
ns253767
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27299962
ns26667416
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
810628
ns833929
ns0.97
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3331771
ns3319083.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1763000
ns2337292
ns0.75
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1769417
ns2337917
ns0.76
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6317646
ns6313500
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
204848.5
ns204383
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/oneAPI
81259709
ns80623182
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
209783
ns213737
ns0.98
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11521375.5
ns11497229
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6550500
ns8328208.5
ns0.79
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
6561792
ns8338541.5
ns0.79
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21242604
ns21078124.5
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
741852
ns737191.5
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI
122682886.5
ns126245472
ns0.97
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1060031
ns1058001
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6292
ns4750
ns1.32
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5666
ns6875
ns0.82
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7042
ns6874.5
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5209
ns6791.5
ns0.77
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
132073.5
ns130168
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5592627
ns5745155
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
54021
ns56791
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10375
ns7333
ns1.41
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9584
ns7312.5
ns1.31
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7417
ns9042
ns0.82
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7667
ns7375
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
718413.5
ns712471
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
35185062
ns35913527
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
375894
ns368228.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144542
ns150042
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
124479.5
ns93750
ns1.33
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
101625
ns126666
ns0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
150583
ns97708
ns1.54
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
148583.5
ns148678
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6180627
ns5748457.5
ns1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
182281
ns203522
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2030666.5
ns2036375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2034833.5
ns2027000.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2034166.5
ns2032104
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2024125
ns2023625
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
674148
ns663877
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31102723.5
ns33751272
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1114502
ns1110211
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
32917
ns34208
ns0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
35208
ns36458
ns0.97
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
33334
ns36083
ns0.92
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
645.5
ns708
ns0.91
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15722
ns15530
ns1.01
batchedmm(2, Bsize=4)/forward/GPU/oneAPI
72749276
ns73822262.5
ns0.99
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
79041
ns78911
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
3208
ns2542
ns1.26
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
3958
ns2833.5
ns1.40
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3084
ns3500
ns0.88
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2333
ns2209
ns1.06
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
136962.5
ns136004.5
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI
93442257
ns93721653
ns1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
340914
ns339263
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7250
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5417
ns6084
ns0.89
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5333
ns6083
ns0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10208
ns10125
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35974
ns35188
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1275721
ns1207919
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50280
ns48090
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215209
ns244041.5
ns0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228896
ns227416.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220729.5
ns224625
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205917
ns216417
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
240303
ns236066
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26463549
ns28254417
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
519340
ns573176
ns0.91
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21966
ns21615
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2150661
ns2072507
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
42521
ns42031
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14709
ns14666
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14792
ns14958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14834
ns14916.5
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14708
ns14917
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
299460
ns297040
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11041214
ns11259133
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
188891.5
ns188487
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
128584
ns120896
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
128208
ns103687.5
ns1.24
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
106604
ns130792
ns0.82
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
119354
ns100583
ns1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132553
ns149147
ns0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5647153
ns6201158
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
183902
ns204362
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924833.5
ns1925625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1932167
ns1922584
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1926479
ns1924687.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1925542
ns1918000
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
662628
ns656144
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31206115.5
ns29883253.5
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1065881
ns1218242.5
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17958
ns19042
ns0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18625
ns21375
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20812
ns20375
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19584
ns18625
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104706.5
ns102936.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3361468
ns3227536
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81176
ns80560.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217417
ns216541.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
265209
ns239771
ns1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222291
ns223709
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222917
ns247021
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
497576
ns493608
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
19289828.5
ns19781802
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
466715
ns479335
ns0.97
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24687
ns24792
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
29083
ns30625
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
27250
ns29334
ns0.93
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1417
ns1291
ns1.10
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16449.5
ns15803
ns1.04
batchedmm(16, Bsize=4)/forward/GPU/oneAPI
73049342
ns73776659
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
80571
ns81571
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4729.5
ns4770.5
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
5917
ns5125
ns1.15
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5459
ns5396
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4875
ns4500
ns1.08
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
201398
ns200310
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI
95411769
ns93984240
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
373024
ns379744
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
223084
ns226292
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
223479.5
ns223000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
225458.5
ns224167
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
222541
ns224042
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
220423
ns218225
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7765235
ns7741481.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
274373
ns274597.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
497687.5
ns510000
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
497958
ns507812.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
501646
ns554604
ns0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
507125
ns496958
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1033721
ns1023666.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
42343394
ns43947953
ns0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
858214
ns871339
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20625
ns21084
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22500
ns19896
ns1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21791
ns21792
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20042
ns20375
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
112240
ns110346
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3510862
ns3666139
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
77390
ns79030
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213084
ns212625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218104.5
ns220750
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219292
ns216750
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217125
ns215292
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
716111
ns706350
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
24489616
ns24785967
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
532795
ns536195
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6708
ns6292
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7416
ns6500
ns1.14
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8166
ns8667
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6791
ns7083
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
133925.5
ns133065.5
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5628647.5
ns5772071
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
65140
ns65491
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9709
ns10416
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12458
ns9750
ns1.28
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11125
ns10291
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10583
ns10959
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
779907
ns769680.5
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
38144455
ns39162885
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
379434
ns386734
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7250
ns5333
ns1.36
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5250
ns5792
ns0.91
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6834
ns7416
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4917
ns6833
ns0.72
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
135559.5
ns135171
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5624998
ns5588783
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
56400
ns68351
ns0.83
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7666
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns7375
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns7646
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns7458
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
742169
ns735742.5
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
39269500.5
ns40095077.5
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
389854
ns390854
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14503334
ns14524375
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7723249.5
ns10107916
ns0.76
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
7705416.5
ns10121667
ns0.76
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27810125
ns27752459
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
535378
ns528552
ns1.01
batchedmm(128, Bsize=512)/forward/GPU/oneAPI
95953628
ns98194699
ns0.98
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
390439
ns402574
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46519500
ns46533666.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26614709
ns33493083
ns0.79
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
26530062.5
ns33509291
ns0.79
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85657500
ns85143125
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2847450.5
ns2860259.5
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI
193094633
ns194954753.5
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3284834
ns3304313.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
68958
ns67916
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
69084
ns66833.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68500
ns69271
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
68166
ns70875
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104098
ns101002.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3588750.5
ns3722652
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
232172
ns234933
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
480417
ns480708.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
475791
ns482250
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
474812.5
ns478395.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
481041.5
ns469167
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
714971
ns703449
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27636818
ns28256303
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
793828
ns791863.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
750
ns583
ns1.29
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns584
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32749
ns32037
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1171475
ns1235389
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
49671
ns49701
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9875
ns8959
ns1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9875
ns8667
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9375
ns9562.5
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9208
ns8250
ns1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
282467
ns277434
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21597763
ns22231771.5
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
373314
ns376664
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9708
ns9625
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9708
ns9708
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9625
ns9667
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9666
ns9625
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23485
ns23071
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
2051864
ns2068828
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
211472
ns210172
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
50208
ns50291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
50042
ns50417
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
50709
ns50875
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
50209
ns50583
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
277646
ns274332
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11587139
ns12829198
ns0.90
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
614117
ns609897
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
55291
ns55125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
54458
ns55875
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54334
ns55917
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56458
ns55917
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28038.5
ns27736
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1177501
ns1172928.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
206412
ns203487.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
479020.5
ns530167
ns0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
525042
ns505208
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
499937
ns508854.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
462667
ns467249.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
240355
ns234697
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
33623204.5
ns33825831.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
838988
ns884779
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
609500
ns652145.5
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
661417
ns645666
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
659375
ns651250
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
653812.5
ns641645.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192690.5
ns186754
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8134102
ns8762924.5
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
262482
ns301613
ns0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2226104
ns2247709
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2247458
ns2260312.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2238104
ns2234500
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2244458.5
ns2220417
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
927304
ns904539
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
49609819
ns49644380
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1364114
ns1208692
ns1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20208
ns21125
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22354.5
ns23708.5
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22167
ns22229
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19375
ns22083
ns0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
109169
ns106895
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3430083
ns3512219
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
77150.5
ns79091
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222958
ns265500
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220604.5
ns222041
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227521
ns232667
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225417
ns258208
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
712641
ns700202.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26229825
ns26916118
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
558770.5
ns555335
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns500
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
584
ns583
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23081
ns22789
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1191881
ns1272891.5
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
48321
ns48001
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9208.5
ns9083
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns9083
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10666
ns9938
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9791.5
ns9875
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
263338
ns259037
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
25373530
ns24963258.5
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399114
ns395944
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10500
ns7875
ns1.33
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8770.5
ns10062.5
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10499.5
ns10520.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
10083
ns10583
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
115864
ns113441
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3318539
ns3336239
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
68530
ns70210
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7917
ns7833.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7625
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns7959
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7875
ns7708
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
487126
ns472332
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
16968946
ns17695066.5
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
322433
ns319678
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1708
ns1625
ns1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1667
ns1916
ns0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2125
ns2020.5
ns1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1541
ns1583
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
19744
ns19708
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1156033
ns1164130
ns0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
191542
ns189672
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3584
ns3584
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3708.5
ns3750
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3937.5
ns3792
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3625
ns3584
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
212174.5
ns208603
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
10575153
ns10416766
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
580786
ns583536
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
147562.5
ns148916.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106562
ns127916
ns0.83
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
107333
ns130229
ns0.82
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225583
ns225208
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
23301
ns22520
ns1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1161742
ns1193772.5
ns0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
34030
ns40501
ns0.84
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
160417
ns160729.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
87959
ns123458
ns0.71
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
100250
ns114792
ns0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
252167
ns264249.5
ns0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
211748
ns208808
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10753069
ns10974999
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
214182
ns268837.5
ns0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7333
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5959
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5250
ns5959
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10417
ns10042
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33560.5
ns32455
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1264867.5
ns1162105
ns1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50310
ns50660
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
253958.5
ns231563
ns1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
253021.5
ns235208
ns1.08
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
235708
ns235250
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212792
ns214541.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
260417
ns252456
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27289933
ns25967285
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
524496
ns597316
ns0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12375
ns12542
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12583
ns12833
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13896
ns14188
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12792
ns13625
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
134512.5
ns131390
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5610287
ns5550478
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
235902
ns233213
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
23959
ns23666
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24479.5
ns23229.5
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25291
ns24625
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24583
ns23834
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
831522
ns815266
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
40443315
ns40999180
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
684542
ns686982
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9708
ns8667
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9917
ns10208
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11625
ns10729
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9209
ns9250
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
120339
ns116988
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3402718
ns3575157.5
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
72241
ns73990
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13750
ns14459
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14187.5
ns13833
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15083.5
ns14395.5
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14084
ns14250
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
638601
ns625118
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
21632295
ns21155361
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
363914
ns365734
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9208.5
ns8959
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10000.5
ns9709
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11166
ns10854.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10167
ns10291
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
118694
ns116936.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3312719
ns3357605
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
72320
ns73371
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13208.5
ns12458
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13020.5
ns12270.5
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13396
ns12937.5
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12292
ns13291
ns0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
529419
ns515797
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19314666
ns18556708
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
342414
ns340984
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
30416.5
ns27562
ns1.10
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
33666.5
ns33833.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
30542
ns31542
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1917
ns1750
ns1.10
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16576
ns16227
ns1.02
batchedmm(2, Bsize=128)/forward/GPU/oneAPI
80979304
ns78590127
ns1.03
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
77461
ns81231
ns0.95
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5291.5
ns5291.5
ns1
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
4896
ns4979.5
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5291.5
ns5416
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6417
ns6458
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
137601
ns135144
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI
111031730
ns109913428
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
379919
ns370274
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
334
ns375
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24898
ns23995
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1219018.5
ns1260687.5
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
49280
ns47300
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6750
ns6417
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns6541
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6916.5
ns6584
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6667
ns6708
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
184245
ns180494.5
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
22834153.5
ns23985458.5
ns0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
386844
ns386289
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2125
ns2000
ns1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2167
ns2084
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2166
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2083
ns2125
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25661
ns25421
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1228045
ns1185083
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
208752
ns206112
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17250
ns16896
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17292
ns17000
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18584
ns18333
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18416.5
ns17292
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
269097.5
ns264717
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
26227590
ns25260741
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
693937
ns702657
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
150875
ns166125
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
177416.5
ns177603.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
153625
ns148958
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
157791
ns148917
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191062
ns187074
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7761294
ns7946915
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
174992
ns226902
ns0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1338521
ns1327854.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1328479
ns1318125
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1328250
ns1326521.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1330083.5
ns1295625
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
866603
ns844331.5
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
46656075.5
ns47016714
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1114201.5
ns1001545
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
26208.5
ns32583
ns0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
29479.5
ns26000
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27062.5
ns26541.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24833
ns26124.5
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228889.5
ns226484
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7347865
ns7837953
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116211
ns115201
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
117584
ns131875
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
140791
ns152250
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
126021
ns153750
ns0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
119916.5
ns131625
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
992184
ns970881
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
46564017
ns45298380
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
594546
ns614061
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
334
ns250
ns1.34
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
334
ns334
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23038
ns22483
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1228991.5
ns1212860.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
49341
ns49500
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6833
ns6459
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6604
ns6417
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7042
ns6750
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6791
ns6563
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
200303.5
ns197111
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25134864
ns25776994
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
388994
ns390483
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6375
ns5750
ns1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5875
ns6458
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7812.5
ns6979
ns1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6458
ns5333
ns1.21
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
139406.5
ns136326
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5798964
ns5775885
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
235513
ns234652
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10083.5
ns10000
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10167
ns10334
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10500
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9959
ns10125
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
853447
ns837095
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
40259621
ns40912011
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
676147
ns671137
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
750
ns667
ns1.12
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
750
ns667
ns1.12
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns750
ns0.89
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
750
ns708
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23007
ns22523
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2046355.5
ns2081039
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209722
ns208153
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4958
ns4833
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5000
ns4917
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5125
ns5250
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4917
ns4875
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
221201.5
ns217286
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9546335
ns10433598
ns0.91
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
585401
ns579966
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8708
ns7709
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8833.5
ns9042
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9812.5
ns10083.5
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8625
ns9125
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
118248.5
ns114799.5
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3665770
ns3549639
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
71271
ns74946
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8959
ns9125
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9041.5
ns8459
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9333.5
ns8958
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8687.5
ns8542
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
566922
ns551169
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
21117090.5
ns20846570.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
343484
ns344893
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
126584
ns126604.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
96271
ns129541
ns0.74
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
96479.5
ns130458
ns0.74
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183375
ns182896
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46672
ns46147
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/oneAPI
72329738.5
ns73869604
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
99821
ns104461
ns0.96
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
330333
ns341208
ns0.97
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
166292
ns327416
ns0.51
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
170250
ns345562.5
ns0.49
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
572041.5
ns569312.5
ns1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
187343
ns183705
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI
93991840
ns92315110
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
487975
ns502435
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
398958
ns399333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215334
ns288167
ns0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215041
ns288020.5
ns0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
753500
ns755875
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43980
ns43522.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1471446
ns1347689.5
ns1.09
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
81451
ns80731
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1401520.5
ns1404291.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
862917
ns1136208
ns0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
861417
ns1136375
ns0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2361042
ns2442125
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
253211
ns242542
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
10756472
ns9970984
ns1.08
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
349378.5
ns353108.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
651917
ns643667
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
658334
ns649416
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
662479
ns646791.5
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
579395.5
ns640271.5
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189789
ns184288
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8425868.5
ns8427619.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
261218
ns303113
ns0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2487416
ns2480000
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2468708
ns2441417
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2451333
ns2445375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2415666
ns2435667
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
951768.5
ns927220.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
52278704
ns53936788.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1454255
ns1316013
ns1.11
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
33000
ns33875
ns0.97
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36083.5
ns35271
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
32167
ns34937.5
ns0.92
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
1041.5
ns917
ns1.14
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16094
ns15816
ns1.02
batchedmm(2, Bsize=32)/forward/GPU/oneAPI
73434519
ns76295890
ns0.96
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
77491
ns79581
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3187
ns3209
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3208
ns3291
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3417
ns3417
ns1
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3209
ns3042
ns1.05
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
136515
ns134276.5
ns1.02
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI
97956919.5
ns97067229
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
349978
ns337123
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
437166.5
ns437667
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
433083
ns437833
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
434750
ns438458.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449916
ns447416.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42836
ns42161.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1439530.5
ns1435529
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238823
ns241817.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4154959
ns4145167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4268667
ns4268333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4254625
ns4271604
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4048000
ns4025417
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
236422
ns230700.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
38227667
ns36716035
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1232498
ns1427924
ns0.86
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3875
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34298
ns34754
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1231913.5
ns1243353
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
40891
ns40071
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15583
ns15458
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15666
ns16042
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15708
ns15875
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15459
ns15625
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
255323
ns252802
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8849424
ns8940938
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
170142
ns171532
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
403708
ns404167
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
221167
ns295916
ns0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
220959
ns296417
ns0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
756709
ns760709
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113380
ns113187
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1019116
ns1044690.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
89671
ns89211
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1430083
ns1444875
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
886645.5
ns1158416
ns0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
879208.5
ns1158333
ns0.76
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2383084
ns2464875
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
238474
ns231034
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
11515696
ns10580994
ns1.09
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
354939
ns352438
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
625
ns583
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
584
ns625
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
625
ns583
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
584
ns625
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24737
ns24556
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1220602
ns1215077.5
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
210152
ns207412
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8042
ns7542
ns1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7916
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8020.5
ns7958.5
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8084
ns7708
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
206918.5
ns202724.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
26025510
ns26299646
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
691747
ns691927
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
829437
ns833708.5
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
466125
ns617667
ns0.75
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
467854
ns620250
ns0.75
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1548750
ns1558000
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130261
ns134627
ns0.97
batchedmm(128, Bsize=32)/forward/GPU/oneAPI
75506190.5
ns75767504.5
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
166677
ns232042
ns0.72
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2692000
ns2690520.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1529979
ns2001666.5
ns0.76
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1534291.5
ns2002375
ns0.77
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4940020.5
ns4923459
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
232798.5
ns232967
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI
103506946
ns99203033
ns1.04
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
770132.5
ns768327.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32356
ns31737
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1255980.5
ns1097642
ns1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
48991
ns46990
ns1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6583
ns6250
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6625
ns6334
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6667
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6625
ns6500
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
227984
ns216848.5
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21076541
ns22868710.5
ns0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
356278.5
ns363084
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1758084
ns1756083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1756792
ns1773708.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1737458
ns1731875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1733750
ns1723709
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
188495
ns185580
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8152359
ns8097715
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
357369
ns375774
ns0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4372937
ns4363834
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4370667
ns4360063
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4369375
ns4378875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4362583.5
ns4369520.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
853700
ns829356
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
48123461
ns48033448.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1252878
ns1396403.5
ns0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6792
ns7146
ns0.95
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7209
ns9833
ns0.73
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7333
ns7250
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7312.5
ns6875
ns1.06
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22968
ns21835
ns1.05
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1178755.5
ns1202854
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
37681
ns40090.5
ns0.94
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
48354
ns68125
ns0.71
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
69083
ns66458.5
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33542
ns51312.5
ns0.65
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
44979
ns32958.5
ns1.36
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
210612
ns205180
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10579631
ns10713432
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
235022
ns269342.5
ns0.87
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21334
ns22083.5
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
24750
ns25042
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
22583.5
ns24666.5
ns0.92
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5417
ns5583
ns0.97
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18352
ns17692
ns1.04
batchedmm(2, Bsize=512)/forward/GPU/oneAPI
88615648
ns89463574
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
90001
ns84500.5
ns1.07
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12187
ns12041
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9250
ns10167
ns0.91
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
9625
ns10584
ns0.91
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18375
ns17770.5
ns1.03
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
219960
ns217435
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI
148558454.5
ns145607100.5
ns1.02
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
383514
ns372684
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
407000
ns406209
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
223500
ns297375
ns0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
223250
ns297291
ns0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762333
ns762584
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47174.5
ns46433
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1364423
ns1403980.5
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
90560
ns89281
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1429042
ns1428979
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
893625
ns1164271
ns0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
893041
ns1168292
ns0.76
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2387667
ns2470833
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
278164
ns271835
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
14883883
ns11893591
ns1.25
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
378859
ns378099
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
435708
ns437000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
431625
ns440041
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
432333
ns438959
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
450291
ns449417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54012
ns53469
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1020178
ns1006988
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238112
ns234822
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4144125
ns4132333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4245667
ns4262646
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4258583
ns4266645.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4033625
ns4029729
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
257888
ns251074
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32096567.5
ns31753545
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1222232
ns1374018.5
ns0.89
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9459
ns9542
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7250
ns8167
ns0.89
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7250
ns8167
ns0.89
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
13458
ns13417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24527
ns23409
ns1.05
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2168792
ns2209102
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
211892
ns211472
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
49500
ns49709
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
49708
ns49667
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
49417
ns50250
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
49208.5
ns49792
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
339671
ns333916
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
12414783
ns10942581.5
ns1.13
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
654987
ns658106.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
125000
ns84687.5
ns1.48
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
89417
ns90459
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86583
ns85792
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
120666.5
ns84021
ns1.44
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191941.5
ns191047
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5724606
ns5785931
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
200372
ns222432
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2022250
ns2027833
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2017666.5
ns2014979.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2024042
ns2016229.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2020812.5
ns2015812.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
516999
ns505179
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
28747125
ns28452120
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1090611
ns1086300
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.