Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

feat: auto-training mode and strict checks #145

Merged
merged 1 commit into from
Aug 29, 2024
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Aug 29, 2024

initial part of #98. We start off with a warning but in 1.0, we will transition to an error

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: fb000d0 Previous: 56e40d8 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5874.5 ns 6083.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5791 ns 5729 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7334 ns 8208 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5771 ns 7417 ns 0.78
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 115557 ns 119536 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2721197 ns 2858698 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 751875 ns 774000 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 414784 ns 413554 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10250 ns 9750 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9708 ns 9541.5 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9791 ns 9833 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 10000 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 535648 ns 548421 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 6522514 ns 6391891 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2497541 ns 13032833 ns 0.19
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 682006 ns 680216 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1542 ns 1333 ns 1.16
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1667 ns 3125 ns 0.53
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2000 ns 2750 ns 0.73
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1458 ns 1646 ns 0.89
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21274 ns 21670 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1314672 ns 1360152 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 202792 ns 200500 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31600 ns 30925.5 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3666 ns 4000.5 ns 0.92
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4000 ns 3542 ns 1.13
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4083.5 ns 4250 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3667 ns 3979 ns 0.92
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 142562 ns 146351 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8818919 ns 9417207 ns 0.94
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1444500 ns 1465833.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 150661.5 ns 148801 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57417 ns 57959 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47062.5 ns 46875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46958 ns 46666 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82666 ns 82958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36368 ns 37604.5 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 573767 ns 581086 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1032834 ns 1034333.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 79390 ns 79736 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2308000 ns 2032625 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2219209 ns 2089666 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2265792 ns 2087125 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2271083 ns 1994500 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 458763 ns 234123 ns 1.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7651123 ns 7389099 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7033375 ns 5422500 ns 1.30
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1308453 ns 1219571 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147833 ns 164187.5 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 173437.5 ns 153833 ns 1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 151270.5 ns 150458 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146208.5 ns 167666.5 ns 0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165164 ns 166286 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7366575 ns 7718471.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1388500 ns 1555229.5 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 191773 ns 190051 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1118292 ns 1111646 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1109000 ns 1113250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1120333 ns 1116271 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1118750 ns 1115771 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 681370.5 ns 700098 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36156481 ns 33735820.5 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6091666 ns 6479708 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 922886 ns 1025985 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5666.5 ns 5708.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4229 ns 4292 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6416.5 ns 5708 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5312.5 ns 6395.5 ns 0.83
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 89522 ns 93278.5 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5362277 ns 5349506.5 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 454125 ns 453417 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 72051 ns 59270 ns 1.22
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9208 ns 8625 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8584 ns 8667 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 9000 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 8792 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 589330 ns 614663 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 37693204.5 ns 34665842.5 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5442709 ns 5535062.5 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391815 ns 384114 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17937.5 ns 19000 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18250 ns 18708.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21500 ns 20209 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18291.5 ns 18542 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 65529 ns 66269 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2847915 ns 2900988 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1318875 ns 1296084 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 73415.5 ns 73291 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213209 ns 222000 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224292 ns 211750 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216083.5 ns 223125 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223875 ns 221021 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 344520 ns 354565.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12592357.5 ns 13542856.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5804917 ns 6042354 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477606 ns 480655 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 667 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 1042 ns 916 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 791 ns 0.79
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20212 ns 20668 ns 0.98
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1171756.5 ns 1159361.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 286125 ns 278000 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 34560 ns 34450 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1520.5 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1667 ns 1583 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 122479 ns 125954.5 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8995739 ns 8769872 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1460625 ns 1445458 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 130722 ns 128201.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7417 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 6125 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 6167 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10333 ns 10292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23423.5 ns 24236 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1238741 ns 1337008 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 621896.5 ns 513333 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49041 ns 48301 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 371104 ns 220959 ns 1.68
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 413916 ns 269208.5 ns 1.54
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 422208 ns 263750 ns 1.60
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 363833.5 ns 225583 ns 1.61
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 407257 ns 192759 ns 2.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31598972 ns 29975947 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9262166.5 ns 8958750 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 730489 ns 608296 ns 1.20
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4083 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4084 ns 4083 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4042 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23414 ns 23710 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2073024 ns 1967682 ns 1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 222041.5 ns 220708 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 52231 ns 48820 ns 1.07
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16834 ns 16625 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17333 ns 16667 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17125 ns 17125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16917 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 188995.5 ns 195325 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 11366884 ns 9992661.5 ns 1.14
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 932458.5 ns 956125 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 178432 ns 177522 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 510041 ns 509750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405334 ns 404666 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 404000 ns 405500 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865041 ns 864500 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113309 ns 113934 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 393832 ns 399968.5 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 385354 ns 452750 ns 0.85
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 249783 ns 248142 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2327625 ns 2323875 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2029083 ns 2027687 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2034187 ns 2035333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3282959 ns 3278166 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 236546 ns 240558 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11904169 ns 9061829 ns 1.31
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1917833 ns 1864375 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 762089 ns 762112 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6500 ns 6791.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6354.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8229.5 ns 8021 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6333 ns 7520.5 ns 0.84
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 88687.5 ns 92014.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5523378 ns 5475275 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 719916 ns 726667 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 62181 ns 60220 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12375 ns 11083.5 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11417 ns 11729 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12500 ns 12750 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12458.5 ns 11291.5 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 611796 ns 656827.5 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 37468397.5 ns 40222366 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5362250 ns 5480792 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 418230 ns 413864 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 22927 ns 23122 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2183765 ns 2175700 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 222437.5 ns 322625 ns 0.69
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 53110 ns 53980 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2083 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2208 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 210803 ns 223205.5 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11049931 ns 11597723.5 ns 0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1947417 ns 1948500 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 176312 ns 183602 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9042 ns 8812 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8083.5 ns 8583.5 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11895.5 ns 10667 ns 1.12
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8875 ns 8604.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 91893.5 ns 99076.5 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3103648 ns 2921578.5 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 777541.5 ns 808625 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 78651 ns 78481 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16958 ns 17750.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18145.5 ns 18334 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19000 ns 18792 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17750 ns 18104.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 534997 ns 609459 ns 0.88
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 15733339.5 ns 16546891.5 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 4951333.5 ns 5201875 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 394905 ns 392264 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 541 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34529 ns 35893 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1235951 ns 1234257.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 287292 ns 308542 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 48301 ns 45990.5 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 147000 ns 10208 ns 14.40
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 148417 ns 9042 ns 16.41
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 161208.5 ns 11208 ns 14.38
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 148333 ns 9666 ns 15.35
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 451818 ns 268063.5 ns 1.69
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18323518 ns 18168061 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4753042 ns 4946250 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 537806 ns 374863 ns 1.43
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396958 ns 397125 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288125 ns 287584 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287979.5 ns 288125 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755542 ns 756000 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111265.5 ns 112334 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 328662 ns 331349 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 367000 ns 453166 ns 0.81
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 78721 ns 78321 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1454062.5 ns 1442312.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1136416 ns 1128583 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132354.5 ns 1136375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2439916 ns 2441021 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 204364.5 ns 207111 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 9824415 ns 10702686 ns 0.92
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1570541.5 ns 1560271 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 325414 ns 324973 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7395.5 ns 7208.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7666.5 ns 7166.5 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9792 ns 8250 ns 1.19
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7145.5 ns 7750 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 129489 ns 148537.5 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5955624 ns 5887503 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 439000 ns 464209 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 60971 ns 59820 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17125 ns 16500 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15771 ns 15041.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16520.5 ns 15708 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15375 ns 15041 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 855248.5 ns 975911 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44381652.5 ns 46239281 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5527750 ns 5635271 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 435840 ns 439474 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 30500 ns 25083 ns 1.22
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27834 ns 26354.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 31292 ns 29833 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25771 ns 25291 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 191688 ns 200872.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7437045.5 ns 7942712.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 586292 ns 976813 ns 0.60
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 118032 ns 117671 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 154208 ns 103917 ns 1.48
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 150125 ns 154250 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 153709 ns 143979 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104021 ns 112208 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1029587 ns 1080618 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43318293.5 ns 45956117 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5712583.5 ns 5734166.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 600207 ns 598555 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74500 ns 77125 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 85875 ns 74229 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 84000 ns 79687.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75208 ns 75770.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 203652.5 ns 207270 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7714505.5 ns 7792417.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 526209 ns 522646 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 125851 ns 123966 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221270.5 ns 287541.5 ns 0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 295000 ns 301792 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 302375 ns 295041 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221417 ns 218208 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1101887 ns 1107042 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42953193 ns 43388234 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6252458.5 ns 6243958 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 702428 ns 701281.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16375 ns 17834 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 18000 ns 16625 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18374.5 ns 18104.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 17083 ns 16729.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 148672.5 ns 150748.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5794282 ns 5427549 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 443708 ns 452333 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 240983 ns 237672 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27270.5 ns 27500.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26687.5 ns 27583 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27333 ns 28875 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27167 ns 25146 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 956079 ns 981795 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42186646 ns 43049751.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5738708 ns 5608208 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 716559 ns 715207 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11291 ns 12313 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10812.5 ns 10020.5 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13646 ns 12417 ns 1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11833 ns 11208.5 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 121242 ns 122999 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3475987.5 ns 3856591 ns 0.90
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 857542 ns 783354.5 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 243332.5 ns 244302 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22416 ns 22250 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22458 ns 21396 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21833 ns 23000 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22333 ns 21687.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 685433 ns 704827 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21324187 ns 19822430 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5433000 ns 5200895.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 684518 ns 687056 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 64291 ns 63625.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 62291 ns 62583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68000 ns 66646 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 63458 ns 66937.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 106200.5 ns 105671 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3278706 ns 3419840 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1343125 ns 1336188 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 240213 ns 238667 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449625 ns 475666 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 450041.5 ns 448750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 452875 ns 446208 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 473708 ns 478625 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 506804 ns 516873.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20711437 ns 20193366 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6340770.5 ns 6184938 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 734073.5 ns 717327 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6750 ns 7271 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8333.5 ns 7083 ns 1.18
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9583 ns 8250 ns 1.16
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8084 ns 7521 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 144433.5 ns 146807 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5559405 ns 5447467.5 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 442333 ns 462959 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59460.5 ns 61520 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15271 ns 15854.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15042 ns 13895.5 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16687.5 ns 15458 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17292 ns 14041 ns 1.23
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 925112 ns 952735 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38208450 ns 39171319.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5445417 ns 5387667 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 403345 ns 406764 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6148375 ns 6150187.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6378583 ns 6375084 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6377979.5 ns 6377896 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11916021 ns 11916958 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 347485 ns 345906.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 294433.5 ns 293393 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19132250 ns 19109896 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19973479 ns 19969688 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 20014417 ns 19911667 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36557250 ns 36665438 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1025341.5 ns 1011944.5 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1151734 ns 1168811 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1000 ns 917 ns 1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 959 ns 959 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 959 ns 917 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23037 ns 23377 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2116275 ns 2088445 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 288334 ns 218062.5 ns 1.32
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215612 ns 214272.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3625 ns 3667 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3875 ns 3791 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 275557 ns 284240 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11382415 ns 10831129 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2104917 ns 2013458 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 651302.5 ns 642396 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8167 ns 7834 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7792 ns 8250 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9542 ns 9208.5 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8563 ns 9125 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 119002 ns 120248 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3379292 ns 3419283 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 784000 ns 777875.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 72261 ns 68341 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11041 ns 11875 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11209 ns 12167 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12958 ns 12625 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10812.5 ns 12709 ns 0.85
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 626600.5 ns 645406 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22838448 ns 20750949 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 4545250 ns 4833541 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 371774 ns 362853 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22136 ns 22453 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2088688 ns 2036768.5 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 221666.5 ns 218625 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 53240 ns 51501 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2833 ns 2833 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2833 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3166 ns 3209 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 2834 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 198602.5 ns 203537.5 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9486294 ns 9796958 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1514917 ns 1523062.5 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 161262 ns 161481 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12084 ns 11875 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11792 ns 10875 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14292 ns 13208 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12000 ns 11875 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 119093 ns 120852 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3487542 ns 3578611 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 826000 ns 824000 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 243253 ns 240102 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20416.5 ns 21792 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20479.5 ns 21834 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20729 ns 22271 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 19479.5 ns 20584 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 579311 ns 600222 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20554218 ns 20377991.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4772875 ns 4668250 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 652368 ns 663226 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24066.5 ns 24569 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2256365 ns 2264680 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223000 ns 222854.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 52711 ns 52551 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16667 ns 16542 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16458 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16750 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16625 ns 16584 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 325746.5 ns 329740.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12327891 ns 12210269.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1081354.5 ns 1074708 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 211922.5 ns 212612 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 2083 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2041 ns 2167 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 1958 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35040 ns 36693 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1241357 ns 1172885.5 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 286125 ns 289959 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 207902 ns 206982 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 148896 ns 17541.5 ns 8.49
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 144771 ns 19584 ns 7.39
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 149292 ns 18875 ns 7.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 145000 ns 19896 ns 7.29
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 484165.5 ns 291009.5 ns 1.66
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 10024099 ns 19691709 ns 0.51
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4663146 ns 4873604 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 819939 ns 691816 ns 1.19
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 58833.5 ns 59750 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 66541.5 ns 65125 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66854 ns 66229 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51250 ns 51250 ns 1
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66577 ns 66341 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 101381 ns 96856 ns 1.05
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 134562.5 ns 149084 ns 0.90
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 166750 ns 109437.5 ns 1.52
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 164792 ns 142625 ns 1.16
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 233729 ns 252625 ns 0.93
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 213970 ns 218082 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 586847 ns 579290.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83583 ns 128229.5 ns 0.65
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86166 ns 124458 ns 0.69
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86791 ns 121520.5 ns 0.71
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84666 ns 84354 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193431 ns 193150.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5266325 ns 5581378 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1961875 ns 1913292 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 170922 ns 170532 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1910750 ns 1825541.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1913479.5 ns 1917500 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1930458.5 ns 1726708 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1911667 ns 1896375 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526485 ns 531416 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25811415 ns 25804121 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8890041 ns 9091041.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086682 ns 1081700 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21239 ns 21564 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2029526 ns 2150170.5 ns 0.94
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 329625 ns 322646 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 45115.5 ns 44940 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1791 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 249589.5 ns 253017.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 10044931 ns 9954512 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1052750 ns 1489959 ns 0.71
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 183272 ns 184662 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10083 ns 8208 ns 1.23
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8854.5 ns 8354 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11041.5 ns 11062.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9000 ns 11375 ns 0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116531 ns 117709 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3670049 ns 3377279 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 818083.5 ns 841167 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 239873 ns 237972 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8541 ns 10750 ns 0.79
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8479.5 ns 9625 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8854.5 ns 10333 ns 0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8084 ns 9437.5 ns 0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 513528 ns 528567.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20228071.5 ns 20481875 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 3987521 ns 4066875 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 650677.5 ns 650956 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57958 ns 58458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46500 ns 46834 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46417 ns 46541 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83000 ns 82770.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38759 ns 40116 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1415113 ns 1353318 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1142604 ns 1107646 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76641 ns 75891 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2016958 ns 1830958 ns 1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2108125 ns 1987709 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2126125 ns 1806000 ns 1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2031604.5 ns 1902167 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 403273 ns 224182 ns 1.80
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 12592175 ns 33930875 ns 0.37
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11201375 ns 11292291.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1027411 ns 1025890 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418375 ns 418083 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418875 ns 418854.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421396 ns 419624.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 431167 ns 418083 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 207671.5 ns 210311 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7803352.5 ns 7920822 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 527542 ns 525521 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 287424 ns 284163 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 757041.5 ns 669416.5 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 668625 ns 671291.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 677167 ns 684750 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 669208 ns 684021 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1038754 ns 1058312 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45709188 ns 44385100.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6421375 ns 6341125 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 924830 ns 918153 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3445500 ns 3455395.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3467104 ns 3437542 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3440104 ns 3456500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3397583 ns 3441812 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169978 ns 173936 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8429258 ns 8236547 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1333667 ns 1383541.5 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 431205 ns 408024 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6187042 ns 6212292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6193833.5 ns 6192374.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6188917 ns 6230104.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6203437.5 ns 6210542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 985127.5 ns 1001699 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50910191.5 ns 52343757.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7089083 ns 7314167 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1562022.5 ns 1560500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 471645.5 ns 471667 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341416 ns 341500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 341958 ns 341250 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 903541 ns 901083.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46127 ns 46237 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 834768.5 ns 841979 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 398208 ns 403541.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 251432 ns 251513 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2328124.5 ns 2304875 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2034625 ns 2036291 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2036333 ns 2035208 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3318312.5 ns 3278208.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 265130.5 ns 256609 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 16240147 ns 13028144 ns 1.25
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2194667 ns 2192084 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 792439 ns 788718 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57583 ns 57833 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46500 ns 46584 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46291 ns 46083 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82542 ns 83709 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27969 ns 28664 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1424400 ns 1387212 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1133750 ns 1120333 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76011 ns 77001 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2193084 ns 1999146 ns 1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2214979 ns 2075834 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2249042 ns 1881917 ns 1.20
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2140854 ns 1993250 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 445719.5 ns 229523 ns 1.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 38789932 ns 36882455.5 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11702791.5 ns 11806542 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1094522 ns 1046160 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 57917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47083 ns 47250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46500 ns 46750 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82833.5 ns 83250 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47723 ns 49927 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 785704 ns 787820 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1095166 ns 1080583 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77990.5 ns 73870 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2086062.5 ns 1891083.5 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2116208 ns 1970208 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2144666.5 ns 1955187.5 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2046292 ns 1904291 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 487919.5 ns 234662 ns 2.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28265363.5 ns 18211943.5 ns 1.55
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10193083.5 ns 10103542 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1010551 ns 933389 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 33775 ns 34917 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1145972 ns 1226168.5 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 269646 ns 272625 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 50361 ns 47950 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 143208 ns 7479.5 ns 19.15
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 141625 ns 6792 ns 20.85
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 144500 ns 8375 ns 17.25
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 140292 ns 7291 ns 19.24
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 430404 ns 203469.5 ns 2.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 21186553 ns 20240791 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4566833 ns 4583500 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 517175 ns 374783 ns 1.38
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32093 ns 31986 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1228680 ns 1276838 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 251125 ns 250958.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 40461 ns 39251 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 3000 ns 0.89
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2834 ns 2666 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 3000 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2709 ns 3250 ns 0.83
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 184759.5 ns 193112 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7793763 ns 7648888.5 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 963604.5 ns 1228042 ns 0.78
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 159696.5 ns 155301 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 443666 ns 423250 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 426750 ns 422000 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 427375 ns 426584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 425854.5 ns 433042 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 137307.5 ns 138742 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 4108067 ns 6023505.5 ns 0.68
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2074313 ns 2150209 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 326613 ns 350454 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3803750 ns 3765146 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3802666 ns 3779584 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3822250 ns 3801667 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3801875 ns 3781770.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 703504 ns 710296 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33026409 ns 31895528 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10878208 ns 10614458 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1505796 ns 1323602 ns 1.14
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49864187 ns 49864000 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35504062.5 ns 35497062 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35538333.5 ns 35537125 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96916667 ns 96997520.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1592214 ns 1604687 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 998606.5 ns 1017349 ns 0.98
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154562395.5 ns 154531062.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112345958 ns 112258062 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112532584 ns 112366667 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 297051812.5 ns 299279978.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6520428 ns 6477003 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5676402.5 ns 5749519.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 18229 ns 19666.5 ns 0.93
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 17083.5 ns 18542 ns 0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 16042 ns 17562.5 ns 0.91
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15625 ns 15437.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21334 ns 21582 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1139016 ns 1137074 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 217520.5 ns 219625 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26211 ns 27981 ns 0.94
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10834 ns 10854.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9042 ns 8916.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9291 ns 9292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17521 ns 17417 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 257401 ns 261948.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10553951.5 ns 9733148 ns 1.08
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1502875 ns 1502000 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 154702 ns 152941 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8937.5 ns 8021 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9459 ns 8458 ns 1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10208 ns 10375 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8520.5 ns 9583 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 122970.5 ns 125031 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3704089 ns 3572138 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 708166 ns 766396 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 242933 ns 239027.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9270.5 ns 10270.5 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9791 ns 9125 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9792 ns 9833 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9229.5 ns 9562 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 608318 ns 626181 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25004839 ns 23291818.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 4935416.5 ns 5110500 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 655337 ns 669926 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8812.5 ns 9021 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9791.5 ns 9292 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12083 ns 10792 ns 1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8771 ns 9624.5 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 118050 ns 119634.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3451608 ns 3516934 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 875291.5 ns 854750 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72861 ns 69771 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13250 ns 16583 ns 0.80
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13375 ns 12583 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13916.5 ns 14020.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12958 ns 14104 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 580235.5 ns 597754 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19330798 ns 19866015 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4607833.5 ns 4399958 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 354624 ns 354973.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 459 ns 459 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34114 ns 35591 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1260842 ns 1301172 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 274542 ns 273021 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 209992 ns 208092 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 138958 ns 9667 ns 14.37
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 136542 ns 7917 ns 17.25
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 140000 ns 8667 ns 16.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 136083 ns 10542 ns 12.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 438919.5 ns 228879.5 ns 1.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22996988 ns 22093342 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4752042 ns 4715584 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 792008 ns 665037 ns 1.19
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16250 ns 16792 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 15333 ns 18042 ns 0.85
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 14625 ns 15104 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 11792 ns 10520.5 ns 1.12
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21202 ns 21410 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1151785 ns 1212361 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 211187.5 ns 204104 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 188407 ns 189022 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 31708 ns 31875 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 31958 ns 31709 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32375 ns 32312.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32166 ns 32000 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 271317 ns 276685 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10945554 ns 10782012.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1602812.5 ns 1597917 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 607056 ns 603936 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 479875 ns 444417 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 483521 ns 440729.5 ns 1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 446874.5 ns 483875 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 484500 ns 487833 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195239 ns 194859 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6117630 ns 6150501 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1972500 ns 1973354.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 353204 ns 352013 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3824750 ns 3829083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3813709 ns 3817542 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3827041 ns 3807333.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3827375 ns 3833750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 535235 ns 543447.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29684309 ns 29268457 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9492709 ns 9074479 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1391296 ns 1381293 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 783309416 ns 782808250 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542819541 ns 542955458 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 544218417 ns 543245416 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1569457250 ns 1526913187.5 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22543017 ns 22538913 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14103431 ns 14166095 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3007940334 ns 2518672041 ns 1.19
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1800866291 ns 2247031041 ns 0.80
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1792982875 ns 2268043292 ns 0.79
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 5315494791 ns 4817775208 ns 1.10
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 379420999 ns 370296484 ns 1.02
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 89218848 ns 89108951 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76770.5 ns 78291.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 80583 ns 76292 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79646 ns 78708.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76625 ns 75666.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 205479.5 ns 209649 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8543893 ns 7907666.5 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 527542 ns 527271 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 109941 ns 110221 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 289687.5 ns 267125 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 277688 ns 192750 ns 1.44
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 193750 ns 228291 ns 0.85
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 264166.5 ns 274042 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1027978 ns 1049164 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44758239 ns 42798073 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6106375 ns 5942583 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 645427 ns 646896 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199655375 ns 199999187.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138887125 ns 139287958 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139073250 ns 139251125 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 394615000 ns 388390459 ns 1.02
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5834346 ns 5842600.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3411391.5 ns 3422748 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619240375 ns 618321291.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 440309458 ns 440516458 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 440130812.5 ns 441449562.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1195765667 ns 1184281125 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26511719.5 ns 26535363 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 22284299 ns 22224253 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7292 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6209 ns 6041 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 6042 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 9959 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27372 ns 28005 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1220411 ns 1278165.5 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 513792 ns 585916.5 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48370 ns 47711 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 380229.5 ns 214417 ns 1.77
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 359459 ns 220791 ns 1.63
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 363875 ns 221750 ns 1.64
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 335125 ns 208354 ns 1.61
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 418348.5 ns 227388 ns 1.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31583019 ns 32450442 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9289000 ns 9078291.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 668577 ns 531855 ns 1.26
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10187.5 ns 9708.5 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8667 ns 7875 ns 1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10459 ns 10334 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8166.5 ns 8896 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 117261.5 ns 116991 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3477087 ns 3379932 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 755208 ns 844125 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 68860 ns 78200 ns 0.88
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7541.5 ns 10500 ns 0.72
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7209 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns 7958 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7521 ns 8125 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 510936.5 ns 524925 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19179630 ns 19844305 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4013292 ns 4066333.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321554 ns 322173 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns 584 ns 0.79
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25702 ns 26471.5 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1237540.5 ns 1215499 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 303395.5 ns 366750 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 49140 ns 48170.5 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 148125 ns 12709 ns 11.66
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 145187.5 ns 9291 ns 15.63
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 150208.5 ns 10083.5 ns 14.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 145416.5 ns 9750 ns 14.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 461372 ns 258485 ns 1.78
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23030274 ns 22618575 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5062875 ns 5040750 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 565486 ns 393458.5 ns 1.44
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 115125 ns 108458 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 99688 ns 98875 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 101270.5 ns 100521 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146604 ns 146417 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24185 ns 24425.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1199345 ns 1223728 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 261959 ns 258208 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 191147 ns 191276.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 498209 ns 478583 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 517875 ns 480437 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 478937.5 ns 482104 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 477958 ns 478875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 229118 ns 234461 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11518633.5 ns 11870954.5 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2113625 ns 2153250 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 624696.5 ns 620366 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 4833 ns 5416 ns 0.89
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6979 ns 5021 ns 1.39
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7416.5 ns 7000 ns 1.06
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4646 ns 4375 ns 1.06
batchedmm(16, Bsize=32)/forward/GPU/CUDA 15972 ns 16254 ns 0.98
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 78941 ns 79120 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11916 ns 13209 ns 0.90
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10833.5 ns 10333 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11125 ns 11187.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16562.5 ns 16875 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 211125 ns 214352 ns 0.98
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 370104 ns 369103.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 40292 ns 39084 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50709 ns 51604 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 53437.5 ns 52875 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13395.5 ns 13500 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20045 ns 21418 ns 0.94
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86141 ns 78891 ns 1.09
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 41708 ns 37875 ns 1.10
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32375 ns 31625 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 30895.5 ns 31125 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 56916.5 ns 57625 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 188929 ns 194392 ns 0.97
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 417859.5 ns 418224 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1979.5 ns 1854.5 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1833.5 ns 1709 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2375 ns 2209 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1791.5 ns 1958 ns 0.91
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20678 ns 21178 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1163936 ns 1109464 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 291541 ns 296916 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 28881 ns 28730 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2104.5 ns 2208 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2084 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2334 ns 2270.5 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2167 ns 2083 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 201809 ns 204626 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9295907.5 ns 9185930.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1432187.5 ns 1373917 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 137651 ns 136741 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5229.5 ns 5187.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5687.5 ns 5021 ns 1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6375 ns 6708 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4375.5 ns 5292 ns 0.83
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 142965.5 ns 146489.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5606105 ns 5824136 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 449020.5 ns 461083 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 62570 ns 63631 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8271 ns 9208.5 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8000 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8667 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8604.5 ns 8708 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 858818 ns 883915 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 40092469.5 ns 38448137 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5355812.5 ns 5404750 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 384874 ns 389893 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56834 ns 56750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57708 ns 57625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57666 ns 57750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58125 ns 58292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37255 ns 37898 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1223472.5 ns 1225239 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 533354.5 ns 482042 ns 1.11
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 208377.5 ns 206812 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 605750 ns 461021.5 ns 1.31
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623562.5 ns 464812.5 ns 1.34
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 628313 ns 465375 ns 1.35
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582834 ns 443771 ns 1.31
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 532888 ns 262541 ns 2.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28144268 ns 26994597 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8798458.5 ns 8201187.5 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 928810 ns 814918 ns 1.14
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3332667 ns 3312833 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2330333 ns 2337166.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2339292 ns 2336896 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6320375 ns 6300021 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205868 ns 204708 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 204322.5 ns 210502 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11449916 ns 11472333 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8300833.5 ns 8296521 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8348000 ns 8328708 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21108459 ns 21128979.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 742474 ns 742043 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1071261 ns 1071150 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7041 ns 6750 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6125 ns 4709 ns 1.30
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6750 ns 6667 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5791 ns 7083 ns 0.82
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 136817.5 ns 140008 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5779963 ns 5632229.5 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 723333 ns 742458 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 58111 ns 58641 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7209 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7334 ns 6959 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7292 ns 7541 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7042 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 740847.5 ns 764718 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 37404171.5 ns 36130678 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5084958 ns 5226958.5 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 381784 ns 383774 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 124250 ns 138354.5 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 124166.5 ns 98145.5 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 101708.5 ns 101167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 113520.5 ns 106375 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 150861 ns 151797 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6135447 ns 5992457 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2029958 ns 2019062.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 210087 ns 168952 ns 1.24
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1997854.5 ns 1834167 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1999916 ns 2017208 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027166 ns 2009167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2018708 ns 2029979.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701304.5 ns 712230.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32626383 ns 31045366 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10780312.5 ns 10914875 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1123801.5 ns 1252337 ns 0.90
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33041 ns 34437.5 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35959 ns 37312.5 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35604.5 ns 35812 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 708 ns 667 ns 1.06
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15322 ns 15573 ns 0.98
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 81061 ns 72020 ns 1.13
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2542 ns 2604.5 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2875 ns 2792 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2937.5 ns 2959 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2125 ns 2291 ns 0.93
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 136167 ns 141728 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 365774 ns 348083 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7000 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 5916 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6000 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10333 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35823 ns 36891 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1228555 ns 1194445.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 389833 ns 487520.5 ns 0.80
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 51200 ns 49011 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 377104 ns 241166.5 ns 1.56
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 385167 ns 221000 ns 1.74
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 381167 ns 221542 ns 1.72
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 351916 ns 206542 ns 1.70
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 512269.5 ns 240362 ns 2.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27785411 ns 25649976.5 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8478687.5 ns 7897958.5 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 666837 ns 521174 ns 1.28
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21501 ns 21897 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2120479 ns 2185292 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 241959 ns 242250 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45910 ns 47251 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14959 ns 14917 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15000 ns 14875 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14875 ns 15042 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14917 ns 14875 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 305862 ns 311422 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11568964.5 ns 12486721 ns 0.93
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 990500 ns 997375 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 194662 ns 204526.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 151292 ns 109666.5 ns 1.38
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 105542 ns 103749.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 105666 ns 105500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 108334 ns 120375 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135171.5 ns 152552 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5948811 ns 5849524.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2056750 ns 2042500 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 187802 ns 185802 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1926208 ns 1782959 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1919708 ns 1919667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926500 ns 1893875 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1919333 ns 1914750 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 684579 ns 695036 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30815788.5 ns 30814201.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10672792 ns 10692646 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1239603 ns 1072840 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19584 ns 19917 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18583 ns 17584 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21292 ns 21125 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17833.5 ns 19291 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107607.5 ns 109095.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3724684 ns 3459371.5 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1354625 ns 1363791.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75895.5 ns 81501 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218833 ns 216083 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216895.5 ns 249854 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217146 ns 216667 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215166.5 ns 215958.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 512344 ns 521547.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21928657 ns 20634945 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6352541.5 ns 6272667 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 494060.5 ns 476034 ns 1.04
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 25000 ns 24875 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28708 ns 30916.5 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28000 ns 30375 ns 0.92
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1187.5 ns 1250 ns 0.95
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15742 ns 16240 ns 0.97
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 83041 ns 82701 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4791 ns 4375.5 ns 1.09
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5167 ns 4500 ns 1.15
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5208.5 ns 5271 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4750 ns 4750 ns 1
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 204557 ns 208732 ns 0.98
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 380439 ns 382674 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 308208 ns 306583 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 315750 ns 306500 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 308333 ns 309375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 305833.5 ns 307083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227475.5 ns 229206.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7965121 ns 7783651 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 593750 ns 1169270.5 ns 0.51
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 278443 ns 276543 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 559583 ns 537021 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 539500 ns 531791.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 533625 ns 547104.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 529812.5 ns 535708 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1062119 ns 1083213 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43579112.5 ns 45992002.5 ns 0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6188375 ns 6107583.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 881479 ns 867099 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22708 ns 21292 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20083 ns 20083 ns 1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21417 ns 21458 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19542 ns 20416 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112201 ns 113930 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3652323 ns 3636526.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1495604 ns 1444708.5 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79055.5 ns 77785.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212750 ns 215625 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213167 ns 215500 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215021 ns 213812 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215542 ns 216500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 752835.5 ns 748686 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24398027 ns 25547600 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7403000 ns 7444708 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 550166 ns 542025 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7458 ns 6792 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6583 ns 6875 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8271 ns 8250 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7104 ns 6750 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 139804.5 ns 141039.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5773128 ns 5493009 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 730937 ns 712166.5 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 70340 ns 70750 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10042 ns 10291 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10167 ns 9708 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10542 ns 10625 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10499.5 ns 10333 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 813031 ns 833353 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37615031 ns 38773821 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5138875 ns 5198104 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 384474 ns 387004 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6708 ns 6292 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5229 ns 5125 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6708.5 ns 6958 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7125 ns 6875 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 141719 ns 144971.5 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5586545 ns 5893830.5 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 717396 ns 727542 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 62111 ns 60830 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 7479 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7333 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7542 ns 7875 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7520.5 ns 7292 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 771432 ns 792524 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 41795223 ns 40455917 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5408708 ns 5364583.5 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 397734 ns 399744 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14505854.5 ns 14468708 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10125354.5 ns 10147583 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10158937 ns 10085542 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27749437.5 ns 27811041 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 540274 ns 542278 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 399624 ns 384534 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46276833 ns 46218541.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33361542 ns 33417104.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33503958 ns 33420958 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85348583 ns 85450792 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2829546.5 ns 2643632 ns 1.07
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3291354.5 ns 3285991 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66667 ns 67625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 65750 ns 67500 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68833 ns 69791 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66541 ns 68750 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105401.5 ns 119783 ns 0.88
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3426262.5 ns 3660768 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1470791.5 ns 1428437.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234912 ns 230052 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 446541 ns 439833 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 445937.5 ns 443166.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 442354 ns 444771 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 446375 ns 454646 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 728201 ns 731518 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26874397 ns 27861313 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7885708 ns 8299458.5 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 812439 ns 775003 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 666 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31896 ns 33044 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1266980.5 ns 1125170 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 284750 ns 329250 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 51781 ns 48860 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 140375 ns 9104 ns 15.42
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 137583 ns 8625 ns 15.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 142625 ns 10250 ns 13.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 138042 ns 9896.5 ns 13.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 486965 ns 284435 ns 1.71
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 14162730 ns 21375944 ns 0.66
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4861750 ns 5566187.5 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 517885 ns 386843 ns 1.34
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9792 ns 9792 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9833 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9792 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 22840 ns 23403 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2189392 ns 2092475 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 219750 ns 221416 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 217022 ns 215732 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46167 ns 46167 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45916 ns 45875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46542 ns 46458 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46042 ns 46125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 285352 ns 290850 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11520821.5 ns 11312959 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 946416.5 ns 1043959 ns 0.91
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 624726.5 ns 616456 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56250 ns 56333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57125 ns 57042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57125 ns 57167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57709 ns 57958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28529 ns 29487 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1225122 ns 1284760 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 577958 ns 609396 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206332 ns 217177.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 598458 ns 459916.5 ns 1.30
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 604375.5 ns 465375 ns 1.30
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 610292 ns 498229 ns 1.22
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 609166.5 ns 449000 ns 1.36
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 450153 ns 242456 ns 1.86
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19947300 ns 33776271 ns 0.59
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9553437.5 ns 9662458 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 984880.5 ns 842873 ns 1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 648625 ns 647916 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 647375 ns 650791.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 617542 ns 652979 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 636312 ns 664625 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 204431.5 ns 206160 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8163653 ns 8473221 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1348000 ns 1347124.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 232182 ns 237013 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2233874.5 ns 2259250 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2235250 ns 2232542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2243542 ns 2224833 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2238083.5 ns 2241083 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 956768 ns 980993 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49121505 ns 46835859 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7001041 ns 7206958 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1380304 ns 1391854 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19792 ns 20083.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19833 ns 20916.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21896 ns 22625 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20833 ns 21291.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111733 ns 113434.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3625769.5 ns 3471029 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1464666.5 ns 1349084 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77731 ns 75101 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 231708.5 ns 220895.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219791.5 ns 228042 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220833 ns 238875 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 242250 ns 219500 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 722782 ns 734488 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26022488 ns 26435758 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7764979.5 ns 7569709 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 566586 ns 566315 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 666 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22661 ns 23420 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1248244 ns 1232395 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 419333 ns 304916.5 ns 1.38
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 50031 ns 49271 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 142416 ns 9166.5 ns 15.54
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 137875 ns 9833 ns 14.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 142583 ns 10833 ns 13.16
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 138500 ns 9896 ns 14.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 445482 ns 263097 ns 1.69
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 15914244 ns 24623499 ns 0.65
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5263708 ns 5512291 ns 0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 546865 ns 408514 ns 1.34
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10000 ns 10083 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8792 ns 8209 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10709 ns 10542 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9416 ns 10541 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 118544.5 ns 118922 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3536986 ns 3616112 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 813750 ns 828896 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 71570 ns 72900.5 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7500 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7792 ns 7333 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 7959 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7667 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 494002 ns 513625 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17713765 ns 17225102 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 3925291.5 ns 3863708 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 335028.5 ns 334143 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1458.5 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1708 ns 1541 ns 1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2000 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1459 ns 1354.5 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20784 ns 20983 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1154727.5 ns 1163032 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 294792 ns 299208 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 195392 ns 188466.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3334 ns 3333 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3291 ns 3375 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3500 ns 3458 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3292 ns 3334 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 218874.5 ns 222731 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10553778.5 ns 10495351.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1550604.5 ns 1564958 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 595936 ns 593390.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 147416.5 ns 149500 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 127875 ns 128375 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 130167 ns 129542 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225084 ns 225104 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23928 ns 24535 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1233082 ns 1174771 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 272167 ns 264416 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 34770 ns 36880 ns 0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 159708 ns 159541 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 127375 ns 138583 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 110458 ns 138979.5 ns 0.79
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 284500 ns 266500 ns 1.07
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 214034 ns 218895 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10830578 ns 10544510 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2013958.5 ns 1984187.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 240127.5 ns 220122.5 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7291 ns 7292 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6041 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10208 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32360 ns 33604 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1256519 ns 1212259 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 343458 ns 349062.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50810 ns 52470 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 384687.5 ns 256959 ns 1.50
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 399708 ns 230729 ns 1.73
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 388770.5 ns 238125 ns 1.63
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 399666 ns 223021 ns 1.79
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 526360 ns 258353 ns 2.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29342862 ns 29336585 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8734166.5 ns 8308500 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 685707 ns 529375 ns 1.30
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14500 ns 15979.5 ns 0.91
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15292 ns 14667 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 17084 ns 16895.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14666 ns 15625 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 138947.5 ns 140165.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5806482 ns 5484385 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 722000 ns 722958.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238282 ns 238723 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23667 ns 23250 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23916 ns 23479 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23958 ns 24125 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23125 ns 23604 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 853569.5 ns 877758 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40236184 ns 39144107 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5415250 ns 5343687.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 693447 ns 692857 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9584 ns 9375 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9416 ns 9084 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12687.5 ns 10979.5 ns 1.16
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9500 ns 9396 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 121436.5 ns 122868.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3573217 ns 3589561 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 809542 ns 743916 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73381 ns 70681 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13334 ns 14583 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14208 ns 13709 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14438 ns 14125 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13708.5 ns 13083 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 655124 ns 673546 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21439559 ns 21471684.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5342125 ns 5194375 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 370548.5 ns 370953 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9250 ns 8833.5 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8959 ns 8917 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12042 ns 11500 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9375 ns 9875 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 120539 ns 121948.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3310613 ns 4256003 ns 0.78
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 875959 ns 844125 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 70721 ns 69331 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12709 ns 12729 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12958 ns 12416 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13125 ns 12792 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12541 ns 12604.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 540883.5 ns 557458 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18769186 ns 20700383 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4285771 ns 4280750 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 342373 ns 345693 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 30417 ns 30875.5 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33937.5 ns 34167 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 29917 ns 31542 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1792 ns 1792 ns 1
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15983 ns 16044 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 79550 ns 74080 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5333.5 ns 5375 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5188 ns 5084 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5417 ns 5459 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6666 ns 6833.5 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 137115 ns 140705.5 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 386874 ns 367554 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 250 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25197 ns 26129 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1365775.5 ns 1277003 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 399583 ns 290458 ns 1.38
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 50531 ns 48125.5 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 143958 ns 7125 ns 20.20
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 141958 ns 6833 ns 20.78
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 145771 ns 7542 ns 19.33
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 141666.5 ns 7041 ns 20.12
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 405976 ns 192022.5 ns 2.11
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24082892 ns 22542195.5 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5201396 ns 4967750 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 549066 ns 392884 ns 1.40
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 2000 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 1958 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2042 ns 2084 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 1959 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25964 ns 27301 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1226664.5 ns 1311935 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 440479.5 ns 441312.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 210742.5 ns 207332 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 147458 ns 17271 ns 8.54
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 145583.5 ns 16729 ns 8.70
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 150146 ns 17521 ns 8.57
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 145208 ns 16896 ns 8.59
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 451191 ns 269988 ns 1.67
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 15538748.5 ns 25410174 ns 0.61
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5521833 ns 5897125 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 847204 ns 716817 ns 1.18
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 178770.5 ns 152500 ns 1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 176292 ns 178084 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 153833 ns 176437 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150375 ns 173584 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 199142.5 ns 203100 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7652852.5 ns 7890476 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1399625 ns 1351292 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 198312 ns 177192 ns 1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1326625 ns 1312125 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1320083.5 ns 1329791 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1328187.5 ns 1323916 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1320541 ns 1330917 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 894206 ns 913760 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46708540.5 ns 46233546.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6150333 ns 6497521 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1134836.5 ns 1120096 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25709 ns 24729 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25916 ns 24959 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27625 ns 27000 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25875 ns 25187.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 233752 ns 236630.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8257548 ns 7478324 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1008250 ns 992375 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 120881 ns 118741 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 158437.5 ns 118687.5 ns 1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 182896 ns 118229 ns 1.55
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118875 ns 119041 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117416 ns 162729 ns 0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1061825.5 ns 1083875 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46743020 ns 46702412 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5873084 ns 6050937.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 609097 ns 602125 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22335 ns 23359 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1169342 ns 1296199 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 440167 ns 438834 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 51530 ns 48171 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 138000 ns 7187.5 ns 19.20
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 135458 ns 7083 ns 19.12
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 139250 ns 7792 ns 17.87
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 134896 ns 7292 ns 18.50
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 393303 ns 198494 ns 1.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25179341 ns 24144346 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5137437.5 ns 5445583 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 528076 ns 399424 ns 1.32
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6625 ns 6562 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6166 ns 5916 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7833 ns 7875 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5542 ns 7312.5 ns 0.76
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 148231 ns 151846 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5722106.5 ns 5647891 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 441646 ns 439083 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 238152 ns 235952 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 9833.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10166 ns 9812.5 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 10000 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 888256.5 ns 914543.5 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41828927 ns 41915714.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5636292 ns 5632541.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 678427 ns 676221 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 666 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 666 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 666 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22088 ns 22836 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2108789.5 ns 2080700 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 222959 ns 220167 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215243 ns 216072 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4542 ns 4625 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4584 ns 4500 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4791 ns 4833 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4542 ns 4625 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 222119.5 ns 226643.5 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9884248 ns 10356297.5 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1551583 ns 1566604.5 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 600866 ns 602506 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8292 ns 7833.5 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8167 ns 7833 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10167 ns 10187.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7917 ns 8375 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 119558 ns 121209 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3445450 ns 3579394 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 667562.5 ns 718583.5 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 69441 ns 68421 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8209 ns 8500 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8833 ns 8250 ns 1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8959 ns 8895.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8250 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 581371.5 ns 596725 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20320168.5 ns 21444108.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4316250 ns 4307083 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 349214 ns 349143.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126917 ns 126334 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 132875 ns 129959 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130583 ns 130375 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183042 ns 183500 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45635.5 ns 46170 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 101126 ns 98026 ns 1.03
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 338166.5 ns 339041 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 346812.5 ns 314708.5 ns 1.10
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 313875 ns 331646 ns 0.95
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 583145.5 ns 568833.5 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 188975 ns 193689 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 486375 ns 486014.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397541 ns 396875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288312.5 ns 288542 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288000 ns 288416 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756125 ns 756291 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43307.5 ns 43814 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1369209 ns 1380119 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 417000 ns 406708 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83400 ns 83471 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1456666.5 ns 1458250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1135812 ns 1132042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1136958 ns 1134667 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2446667 ns 2445396 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 242723 ns 253213.5 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11709621 ns 11703870 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1775792 ns 1847229.5 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 355854 ns 352553 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 644646 ns 649083 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 657333.5 ns 647625 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 642583.5 ns 653666 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 611437.5 ns 653416 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198699.5 ns 195489.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8946325 ns 8194184 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1346479.5 ns 1355708 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 257148 ns 256777.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2437416 ns 2416000 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2444333 ns 2443646 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2459917 ns 2452084 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2446646 ns 2458000 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 975703 ns 1002651.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 53584726.5 ns 52361708 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7227375 ns 7303334 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1475180.5 ns 1493424.5 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32917 ns 32291.5 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34958 ns 37125 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34750 ns 34959 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 875 ns 958 ns 0.91
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15505 ns 15494 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 73541 ns 78911 ns 0.93
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3167 ns 3125 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3271 ns 3084 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3375 ns 3333 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3041 ns 3083.5 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 135878 ns 140049.5 ns 0.97
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 341823 ns 357403 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406167 ns 406729.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408250 ns 409167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 408750 ns 408916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420042 ns 421291.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42722 ns 43852 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1392129 ns 1426671 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1132750 ns 1145916 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 240492.5 ns 242907 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4010917 ns 3884583 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4140708 ns 3997395.5 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4151999.5 ns 3997583.5 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3932958.5 ns 3773396 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 429590.5 ns 238021 ns 1.80
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 40011947.5 ns 37654037 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11642250 ns 11658833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1280768 ns 1239152 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33393 ns 33239.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1245305.5 ns 1292975 ns 0.96
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 176917 ns 180000 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 41210 ns 40870 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15708 ns 15709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16000 ns 15666 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15959 ns 15958 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15583 ns 15708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 251876 ns 255521 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8767093.5 ns 9126745 ns 0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 864834 ns 866125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 169502 ns 168612 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404584 ns 404125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295542 ns 295459 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295458 ns 295917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760375 ns 760584 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113075 ns 113603 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 999782 ns 1047585 ns 0.95
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 389917 ns 403542 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89371 ns 89361 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1490875 ns 1484917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1158708 ns 1156042 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1162750 ns 1159666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2503542 ns 2467083.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 233635 ns 253490 ns 0.92
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 12058492.5 ns 11945518 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1844333 ns 1868708 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 357394 ns 357078 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 459 ns 458 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25186 ns 26377 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1362627.5 ns 1231417 ns 1.11
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 285833 ns 289375 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 210112 ns 208532 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 139125.5 ns 8083 ns 17.21
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 136708 ns 8084 ns 16.91
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 140166 ns 8916 ns 15.72
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 135875 ns 8292 ns 16.39
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 399319 ns 205980.5 ns 1.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25498054 ns 25622754 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5342541.5 ns 5187020.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 816958 ns 699842 ns 1.17
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 832125 ns 831083.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 616208 ns 618375 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621417 ns 620896 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1542583 ns 1545833 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 132827 ns 129305 ns 1.03
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 170262 ns 168801 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2690792 ns 2685437.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1998875.5 ns 1995875.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2004042 ns 2004125 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4930500 ns 4920541.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 239311 ns 242604 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 865799 ns 856188.5 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31389 ns 32780 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1262909 ns 1223978 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 299167 ns 272646 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 49080.5 ns 48940 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 137834 ns 7042 ns 19.57
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 134834 ns 6959 ns 19.38
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 138834 ns 7750 ns 17.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 135000 ns 7000 ns 19.29
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 434137 ns 221584 ns 1.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21203255 ns 21530894 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4873458 ns 4625916 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 507435 ns 370743 ns 1.37
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2432250 ns 2406167 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2409167 ns 2385438 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2387833 ns 2416791.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2396292 ns 2423834 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189831 ns 195765 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8179796 ns 7988442 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1483000 ns 1576520.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 362144 ns 358903 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4633541.5 ns 4642937.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4653062.5 ns 4645584 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4675166.5 ns 4665146 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4555521 ns 4650250 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 886700 ns 908920 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48864880.5 ns 47384300 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6240563 ns 6194875 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1267073 ns 1416134 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7208 ns 6812.5 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 14083.5 ns 7187 ns 1.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7500 ns 8125 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6916 ns 6979 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23218 ns 23336 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1159818.5 ns 1143574.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 262937.5 ns 254646 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 40390 ns 33330 ns 1.21
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51479 ns 50833.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 48791.5 ns 48375 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 50416.5 ns 64916.5 ns 0.78
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 47000 ns 63979.5 ns 0.73
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 214489 ns 218640 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10649119.5 ns 10406518.5 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2003958.5 ns 2003333 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 223847 ns 237612 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21459 ns 21500 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 25062.5 ns 26146 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24375 ns 25583 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 6000 ns 5250 ns 1.14
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16873 ns 16818.5 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 86560 ns 85311 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11917 ns 12167 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10416 ns 10250 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10417 ns 10792 ns 0.97
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 17750 ns 17916.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 225550 ns 229003 ns 0.98
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 374944 ns 374873 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406166 ns 406333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297084 ns 297083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296875 ns 297291 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762250 ns 762375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46316 ns 46714 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1373094 ns 1399660 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 410792 ns 487958 ns 0.84
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89721 ns 91141 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1485562.5 ns 1491333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1165333 ns 1167145.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1166208 ns 1166417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2514500 ns 2472271 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 287030.5 ns 283154 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11586937 ns 13584197 ns 0.85
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2073208 ns 2090083.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 379943 ns 380044 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 434375 ns 434000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436958 ns 437208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436708 ns 436875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447167 ns 446459 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53910.5 ns 55157 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1015096 ns 1013588 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1130542 ns 1079041.5 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 237292 ns 236567.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4070791.5 ns 3895271 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4204896 ns 3933625.5 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4209333 ns 4028229.5 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3891208 ns 3807292 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 512126 ns 259800 ns 1.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31246857.5 ns 36848551 ns 0.85
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10831209 ns 10417062.5 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1294643 ns 1238382.5 ns 1.05
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8791 ns 8708 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7709 ns 7625 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7667 ns 7666 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12375 ns 12375 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23822 ns 24111 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2157287 ns 2073642 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 224666.5 ns 222645.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 217992 ns 216692 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45375 ns 45375 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45334 ns 45125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45250 ns 45250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45209 ns 45292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 342231 ns 347040 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13553832.5 ns 13931322.5 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1724458.5 ns 1692667 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 676147 ns 670407 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 123500 ns 89750 ns 1.38
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90167 ns 147041 ns 0.61
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 89500 ns 124792 ns 0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80833 ns 126979.5 ns 0.64
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190984 ns 189741 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5773034.5 ns 5754517.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1967042 ns 1948042 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186732 ns 184172 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2013292 ns 2022000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2014792 ns 2017708 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2023604.5 ns 2021875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1964083 ns 2017979.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 532292 ns 539246.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31086820 ns 28808272 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9598208.5 ns 9259396 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 978310 ns 1110411 ns 0.88

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit b9686e4 into main Aug 29, 2024
66 of 72 checks passed
@avik-pal avik-pal deleted the ap/warn_no_train branch August 29, 2024 23:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant