Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

test: add tests comparing the fused op with unfused op #157

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: f11e57d Previous: 40d9192 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6750 ns 5666 ns 1.19
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5958 ns 7459 ns 0.80
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7250 ns 8458 ns 0.86
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5750 ns 7291 ns 0.79
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119059 ns 119078 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2735279 ns 2538616 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 772709 ns 702792 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 591165 ns 427074 ns 1.38
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9875.5 ns 10020.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9917 ns 9750 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10250 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9708 ns 9895.5 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 543283 ns 551531 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17381908 ns 18148603 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2589875 ns 2222000 ns 1.17
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 890074 ns 679576 ns 1.31
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 8437.5 ns 1271 ns 6.64
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 8083 ns 2729 ns 2.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 8792 ns 1708.5 ns 5.15
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 8042 ns 1708.5 ns 4.71
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 23722 ns 21712 ns 1.09
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1332230 ns 1291875 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 214125 ns 183666 ns 1.17
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 30537 ns 31345.5 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 5250 ns 3500 ns 1.50
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4917 ns 3333 ns 1.48
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4833 ns 4208.5 ns 1.15
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4959 ns 4375 ns 1.13
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 148252.5 ns 146456.5 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9011509.5 ns 8037303.5 ns 1.12
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1295209 ns 1510917 ns 0.86
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 140623 ns 146682 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58500 ns 56500 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46937.5 ns 46875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46875 ns 46833 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84250 ns 83459 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39503 ns 36990 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 620600 ns 664843 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1515416.5 ns 1340625 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73577 ns 80736 ns 0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2011875 ns 2031000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2063813 ns 2086333.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2090208 ns 2089292 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2004354 ns 1995354 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 222303 ns 232927.5 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8058610 ns 7734526 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7562916 ns 4323958 ns 1.75
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1047899 ns 1581446 ns 0.66
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 173479.5 ns 147042 ns 1.18
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 176187.5 ns 144625 ns 1.22
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 155396 ns 149833 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147500 ns 151895.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165814 ns 166087 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7631325.5 ns 7754863 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1699333.5 ns 1479250 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 159930 ns 198942 ns 0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1118083 ns 1120063 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1103083 ns 1117666 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1108333 ns 1115750 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1119646 ns 1124875 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701220.5 ns 721156.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37668765 ns 33562933.5 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6581625 ns 6149062.5 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 911594 ns 1022579 ns 0.89
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4333 ns 4166 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4646 ns 5041.5 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5729 ns 6042 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4312.5 ns 6250 ns 0.69
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 92104.5 ns 95202.5 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5286174 ns 5313078 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 597770.5 ns 416333.5 ns 1.44
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 60844 ns 65661 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8834 ns 9000 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8917 ns 8709 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9292 ns 9375 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8583 ns 8417 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 597427 ns 618225 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 35089178 ns 31699887 ns 1.11
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5972625 ns 5433375 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 373409 ns 388724 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17958.5 ns 16229.5 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18208.5 ns 17500 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21958 ns 21916 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17500 ns 18542 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 66685 ns 68340 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3162792 ns 3114761 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1280875 ns 455354.5 ns 2.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75982 ns 75821 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222292 ns 213125 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 225458 ns 212125 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215042 ns 214749.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218916.5 ns 223791 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355209 ns 361191 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 13835720 ns 13957207 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5582750 ns 5399125 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 424704 ns 468614 ns 0.91
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 10541.5 ns 625 ns 16.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 8667 ns 667 ns 12.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 10167 ns 875 ns 11.62
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 7812.5 ns 708 ns 11.03
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 22860 ns 20782 ns 1.10
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1259883 ns 1176905 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 303958 ns 179000 ns 1.70
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 28744 ns 31201 ns 0.92
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 1458 ns 1.80
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2333 ns 1500 ns 1.56
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 1541 ns 1.54
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2209 ns 1333.5 ns 1.66
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 127050.5 ns 128010.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 9494797.5 ns 9057994 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1576500 ns 1474521 ns 1.07
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 117880.5 ns 136491 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14834 ns 7333 ns 2.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 13333 ns 6166 ns 2.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 14375 ns 6166 ns 2.33
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16542 ns 10291 ns 1.61
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33557 ns 24318 ns 1.38
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1327266 ns 1193537 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 572917 ns 341583 ns 1.68
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 64270 ns 47631 ns 1.35
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 239208 ns 231125 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 278959 ns 270583 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 271521.5 ns 270375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257834 ns 213167 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 201499.5 ns 195209.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31139677 ns 31467862 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9587292 ns 9233666 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 601625 ns 645516 ns 0.93
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4209 ns 4125 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 22857 ns 23938.5 ns 0.95
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2143984 ns 2014824 ns 1.06
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 223771 ns 210750 ns 1.06
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 42640 ns 48021 ns 0.89
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 21667 ns 16916 ns 1.28
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 21542 ns 17417 ns 1.24
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 21833 ns 17208 ns 1.27
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 20750 ns 16667 ns 1.24
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 205962 ns 198962 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 11331316 ns 10294946 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 947084 ns 900625 ns 1.05
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 180478 ns 172967 ns 1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 512771 ns 508125 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405875 ns 404416 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 406583 ns 404792 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865042 ns 865375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113394 ns 113291 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 411956 ns 429336 ns 0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 437042 ns 432708 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 396972 ns 242113 ns 1.64
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2313666.5 ns 2329437 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2033125 ns 2034750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2042917 ns 2031750 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3198833 ns 3193375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 255204.5 ns 246406 ns 1.04
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11104817 ns 12521873.5 ns 0.89
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1952000 ns 1893250 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 608372.5 ns 744268 ns 0.82
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6979 ns 5187.5 ns 1.35
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6562.5 ns 7083 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7875 ns 7354 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6771 ns 7542 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 87915.5 ns 93165 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5601042.5 ns 5491281 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 765292 ns 752833 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60874 ns 65211 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11208 ns 12167 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12500 ns 11792 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12375 ns 12374.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10958 ns 11396 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 612766 ns 647871 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38641140 ns 39284056 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5719500 ns 5190667 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 393075.5 ns 411409 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 3125 ns 500 ns 6.25
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 2917 ns 541 ns 5.39
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 3084 ns 500 ns 6.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 2708 ns 500 ns 5.42
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 30967 ns 23724 ns 1.31
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2285306 ns 2212056 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 231333 ns 204584 ns 1.13
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48601 ns 47141 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 12583 ns 2125 ns 5.92
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 11875 ns 2125 ns 5.59
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 11834 ns 2167 ns 5.46
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 10584 ns 2125 ns 4.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 241976 ns 227021 ns 1.07
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 12085430.5 ns 11087876.5 ns 1.09
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1942916.5 ns 1921834 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 190887 ns 172882 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 30125 ns 8208 ns 3.67
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 30250 ns 9146 ns 3.31
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 35999.5 ns 9959 ns 3.61
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 27500 ns 8375 ns 3.28
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 107239.5 ns 104776 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3168249.5 ns 3291769.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 906417 ns 468500 ns 1.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 78697 ns 72700.5 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25312.5 ns 17374.5 ns 1.46
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 24334 ns 18625 ns 1.31
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 25062 ns 18250 ns 1.37
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 24354 ns 18125 ns 1.34
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 574746 ns 580515 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 19314204 ns 17620571 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5293792 ns 4970938 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 372506.5 ns 381279 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2167 ns 459 ns 4.72
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 1959 ns 584 ns 3.35
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 625 ns 3.40
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 1791 ns 458 ns 3.91
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 40056 ns 35839 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1228474 ns 1218575 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 289687.5 ns 423541 ns 0.68
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45445 ns 46311 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11792 ns 9104 ns 1.30
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11917 ns 9333 ns 1.28
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11520.5 ns 9083 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11000 ns 9208 ns 1.19
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 258486.5 ns 261166 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18348508 ns 18752145 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4878041.5 ns 4335125 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 364537 ns 367929 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396667 ns 395708 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287229 ns 288375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288479.5 ns 288375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756334 ns 756292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111990 ns 111964.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 336277 ns 329610 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 367729.5 ns 303771 ns 1.21
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 77224 ns 75611 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1472083.5 ns 1445541 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1139895.5 ns 1129292 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1142896 ns 1133875 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2361791.5 ns 2356333 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 221976.5 ns 210839 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 11123008 ns 10091107 ns 1.10
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1662645.5 ns 1639416 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 305927.5 ns 322414 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7125 ns 7042 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7770.5 ns 8000 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8520.5 ns 8833.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7792 ns 7520.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 138464.5 ns 142989 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5761431 ns 5929780 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 724416 ns 470791.5 ns 1.54
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 63489 ns 66011 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14417 ns 16208 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15229 ns 14250 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15167 ns 16000 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15042 ns 15354.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 957624 ns 963872.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44047156 ns 42665593.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5683459 ns 5541125 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 414450 ns 426829 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26208 ns 24458 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 26208.5 ns 26062.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30041 ns 29916.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25458 ns 25708.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 197573 ns 202495.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7876804.5 ns 8124671 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 975021 ns 985584 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 157760 ns 114461 ns 1.38
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 105062.5 ns 109083 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 150834 ns 152250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 112167 ns 152854 ns 0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 146375 ns 142750 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1069177 ns 1066908 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42327752 ns 41393438 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5806479 ns 5472042 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 553606 ns 588251 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 81687.5 ns 75167 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 78792 ns 74583 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80771 ns 84375 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74979.5 ns 74125 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 204985 ns 208606 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8119657 ns 7473638 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 533312.5 ns 500875 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 124813.5 ns 129022 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 251458 ns 304417 ns 0.83
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 298520.5 ns 302145.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 300354.5 ns 267604 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 318041 ns 221146.5 ns 1.44
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1109883 ns 1119561.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44130956.5 ns 40462234 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6429916 ns 6061271 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 642250 ns 695387 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 17125.5 ns 15729.5 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 16916 ns 17541 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18354.5 ns 18000 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 17187.5 ns 17000 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 144430.5 ns 148248.5 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 6119082 ns 5730909.5 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 583667 ns 745333 ns 0.78
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 225772 ns 232902 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25292 ns 26937 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27708.5 ns 26291.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27792 ns 27291 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 28167 ns 26833.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 969396 ns 995021 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41222701.5 ns 39941943 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5774021 ns 5463292 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 641800 ns 692327 ns 0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 37375 ns 10375 ns 3.60
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 33937.5 ns 11875 ns 2.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 43000 ns 12562 ns 3.42
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 30521 ns 11625 ns 2.63
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 138411.5 ns 125968 ns 1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3833959.5 ns 3534875 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 850833 ns 849958 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 247072 ns 236132 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 24291 ns 22292 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 24125 ns 21542 ns 1.12
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 25458 ns 23416 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 23750 ns 22459 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 701835.5 ns 709781 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21171059 ns 21081902.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5379104 ns 5312812.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 578882 ns 671626 ns 0.86
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 64104.5 ns 63000 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 64208 ns 64875 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67187.5 ns 67624.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 65667 ns 70792 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105929.5 ns 108732 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3410310 ns 3570568 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1311250 ns 463166.5 ns 2.83
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 238446 ns 233653 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 442604 ns 437250 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 485687 ns 448250 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 444375 ns 451208 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 450229 ns 443667 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 511443 ns 523839.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21511920 ns 20377781.5 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5894125 ns 6056791 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 645166 ns 715783 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7625 ns 7104.5 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7479.5 ns 8125 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8750 ns 8333 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7417 ns 7729.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 143471 ns 147799 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5455985 ns 5614298 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 694542 ns 704750 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 61425 ns 65321 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13708 ns 14500 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13812.5 ns 15437.5 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15042 ns 14833 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15834 ns 14146 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 936407 ns 966324 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 39926329 ns 36660688 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5583959 ns 5256874.5 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 394929 ns 400984 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6153667 ns 6153708 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6372917 ns 6380458 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6368479 ns 6380979.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11917041 ns 11947959 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 350729 ns 301662 ns 1.16
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 394713.5 ns 322583 ns 1.22
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19134500 ns 19056521 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19938458 ns 19941000 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19938625 ns 19981146 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36508062.5 ns 36490833.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1138227 ns 1026590 ns 1.11
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1139856 ns 1153502 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 4041 ns 917 ns 4.41
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 3416 ns 959 ns 3.56
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 1000 ns 3.75
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 3292 ns 958 ns 3.44
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 30429 ns 23570 ns 1.29
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2237968 ns 2101433 ns 1.06
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 233792 ns 203000 ns 1.15
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209867.5 ns 207632 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 11208 ns 3708 ns 3.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 12167 ns 3791 ns 3.21
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 13041 ns 3792 ns 3.44
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 11375 ns 3750 ns 3.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 313879 ns 284692.5 ns 1.10
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11584761 ns 11502827.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2157229.5 ns 2063354 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 488504 ns 625846 ns 0.78
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 25999.5 ns 7208 ns 3.61
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 33479 ns 8500 ns 3.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 43250 ns 9292 ns 4.65
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 26333 ns 8250 ns 3.19
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 137461 ns 122668.5 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3438795 ns 3715127.5 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 799584 ns 787166 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 80640 ns 72740 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 17354 ns 11875 ns 1.46
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 19291 ns 12750 ns 1.51
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 19875 ns 12583 ns 1.58
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 18125 ns 12500 ns 1.45
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 666548 ns 651999 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 20922356.5 ns 22144306 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 4784875 ns 4276208 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 359342.5 ns 359014 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22472 ns 22720.5 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2158627.5 ns 2075647.5 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 227854.5 ns 205083 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 42420 ns 47440 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 6209 ns 2875 ns 2.16
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 7209 ns 3500 ns 2.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 7792 ns 3333 ns 2.34
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 6584 ns 3208 ns 2.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 220697 ns 206663 ns 1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9979049 ns 9232071 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1643104.5 ns 1552875 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 160089.5 ns 156172 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12354.5 ns 10083 ns 1.23
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11791.5 ns 11083 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13312.5 ns 12458 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10958 ns 11708 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 123366 ns 123476 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3431504 ns 3456473.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 889208 ns 861479.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 234880 ns 236062 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20542 ns 20604 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21333 ns 23187.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 23333 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21167 ns 21042 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 598497.5 ns 607311 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19557859 ns 20290582.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4742292 ns 4254667 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 574605 ns 645431.5 ns 0.89
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 7125 ns 4458 ns 1.60
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 6917 ns 4500 ns 1.54
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 7375 ns 4417 ns 1.67
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 6833 ns 4500 ns 1.52
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 32096 ns 24732 ns 1.30
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2367486 ns 2177168 ns 1.09
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223083.5 ns 211459 ns 1.05
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 52959 ns 47591 ns 1.11
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 25667 ns 16375 ns 1.57
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 27542 ns 16834 ns 1.64
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 27500 ns 16458 ns 1.67
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 25833 ns 16083 ns 1.61
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 359637.5 ns 332546.5 ns 1.08
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 13316554 ns 12988178 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1347791 ns 1511750 ns 0.89
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 213669 ns 208322 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3375 ns 2084 ns 1.62
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3500 ns 2041 ns 1.71
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 3708 ns 2167 ns 1.71
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3417 ns 2209 ns 1.55
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 41395 ns 36551 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1212620.5 ns 1147028 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 296083 ns 268042 ns 1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 205695 ns 204212 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 18291.5 ns 17396 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 19041.5 ns 17250 ns 1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20354 ns 17812.5 ns 1.14
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20791 ns 19479 ns 1.07
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 292894 ns 297836 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20954498 ns 21470855.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5026958 ns 5022375 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 614799.5 ns 686617 ns 0.90
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60167 ns 56395.5 ns 1.07
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65750 ns 65083 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65458 ns 66250 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 53646 ns 51333 ns 1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66510 ns 66767.5 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 108798.5 ns 115211 ns 0.94
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 188417 ns 197187.5 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 113292 ns 163417 ns 0.69
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 126854 ns 163937.5 ns 0.77
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 305229 ns 315500 ns 0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 220047 ns 219712.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 527527 ns 611147 ns 0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143208 ns 105333 ns 1.36
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83187.5 ns 81834 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84875 ns 86959 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82416.5 ns 86750 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190604 ns 191740.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5436076.5 ns 5593567.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1857292 ns 2535645.5 ns 0.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 206837 ns 204172 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1890208 ns 1915521 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1880729.5 ns 1914333 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1667062.5 ns 1911750 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1763312.5 ns 1879292 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 534577 ns 538609 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26696643 ns 24792062.5 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9225937.5 ns 8911395.5 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 879505 ns 1067201 ns 0.82
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 2625 ns 292 ns 8.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 2708 ns 292 ns 9.27
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 3084 ns 292 ns 10.56
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 2542 ns 333 ns 7.63
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 28896 ns 22127 ns 1.31
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2115362 ns 2111782 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 336500 ns 320417 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 43972 ns 41970 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 10125 ns 1792 ns 5.65
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 12083 ns 1875 ns 6.44
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 13584 ns 1875 ns 7.24
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 11541 ns 1875 ns 6.16
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 280701 ns 255417.5 ns 1.10
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 10783143 ns 10493115 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1321833 ns 1487041 ns 0.89
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 196998 ns 183032 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10187.5 ns 7375 ns 1.38
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10208.5 ns 9562.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11084 ns 11250 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 10500 ns 11333 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119766 ns 121634 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3544552 ns 3330370 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 859750.5 ns 831000 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 236352 ns 235863 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9021 ns 8958 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10167 ns 10917 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9708 ns 11542 ns 0.84
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9292 ns 9250 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 528594 ns 536196 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20308774 ns 20906072 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4428479.5 ns 3661104.5 ns 1.21
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 556110 ns 620146.5 ns 0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 65208 ns 56833 ns 1.15
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 55625 ns 46333 ns 1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 58792 ns 47000 ns 1.25
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 89875 ns 83417 ns 1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50195 ns 40185 ns 1.25
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1487606.5 ns 1391043 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1138750 ns 1150167 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 82043 ns 77886 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1861958 ns 1925959 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1984667 ns 1932875 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1968666.5 ns 1975666 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1901542 ns 1853417 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 235593 ns 224336 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34412214 ns 33169959 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11173167 ns 11254125 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1019646 ns 1176553 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 419250.5 ns 416209 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418458 ns 418021.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421792 ns 423500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 418166 ns 417709 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 211672.5 ns 212391.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8780508 ns 7928224 ns 1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 532833 ns 501042 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 269003 ns 283733 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 670000 ns 689875.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 683417 ns 744770.5 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 674958 ns 684250 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 674541 ns 683020.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1054744 ns 1071393 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45045495 ns 45538634 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6450000 ns 6134687.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 862543 ns 911264.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3451396 ns 3426041.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3392854.5 ns 3415458.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3412042 ns 3440084 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3434750 ns 3459083 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 172559 ns 174794 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8346126 ns 8045126 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1390000 ns 1391250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 666837 ns 426850 ns 1.56
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6190917 ns 6168667 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6183937.5 ns 6210416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6169479.5 ns 6205709 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6221625 ns 6247562.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1001720 ns 1017240 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48855796 ns 50293396 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7250416 ns 7732791.5 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1506478.5 ns 1542501 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 474333 ns 473291 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 343083 ns 342875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 344062.5 ns 341396 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 906521 ns 901791 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 54961 ns 46836 ns 1.17
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 405228 ns 381391 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 420833.5 ns 354270.5 ns 1.19
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 406260 ns 243143 ns 1.67
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2312792 ns 2332208 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2037666 ns 2034354.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2040458 ns 2036500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3206833 ns 3194416 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 303345 ns 273644.5 ns 1.11
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 15525931 ns 15628377 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2177250 ns 2136645.5 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 644946 ns 772838 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 63292 ns 56292 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 55084 ns 45834 ns 1.20
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 52812.5 ns 46125 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90084 ns 83209 ns 1.08
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38580 ns 28601 ns 1.35
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1404175.5 ns 1335147 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1161541 ns 1124979 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 82675 ns 74305.5 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921959 ns 2016104.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2011209 ns 2087291 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1782958 ns 2087917 ns 0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2014334 ns 1975958.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 248887 ns 240545 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 38486330 ns 37474096 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11361792 ns 11883709 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1052699 ns 1048951 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58250 ns 56542 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 48250 ns 46354.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 48125 ns 46666.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84709 ns 83750 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 55882 ns 50752 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 833022 ns 835807 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1106333 ns 1048667 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73938 ns 78556 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1895166 ns 1921000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1977250 ns 1952958.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1934667 ns 1973000 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896166.5 ns 1862417 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 249033 ns 246729 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17897206 ns 16959227 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9818416 ns 9957875 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 961798.5 ns 1034211 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1542 ns 292 ns 5.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1875 ns 416 ns 4.51
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1542 ns 416 ns 3.71
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1458 ns 292 ns 4.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 40187 ns 35694 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1298014 ns 1211794.5 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 436354 ns 311771 ns 1.40
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47208 ns 46570 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8479.5 ns 6604.5 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9083 ns 7291.5 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8083 ns 6666 ns 1.21
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 6709 ns 1.19
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 216055 ns 213644 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 21272220.5 ns 21642370 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5159042 ns 4349083.5 ns 1.19
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 361170.5 ns 366543.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32391 ns 32948 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1244615 ns 1191915 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 251500 ns 153792 ns 1.64
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 36889 ns 39081 ns 0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 6209 ns 3208 ns 1.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 7792 ns 3041 ns 2.56
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 6667 ns 3083 ns 2.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 6541 ns 3083 ns 2.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 206420 ns 193915 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 9181253 ns 7217530 ns 1.27
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 939167 ns 894250 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 154709 ns 158472 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 445041.5 ns 420583.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 455500 ns 420833.5 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 447667 ns 456166.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448542 ns 426229 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 143863 ns 140216.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6251430 ns 6258248 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2006875 ns 2682604 ns 0.75
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 465741 ns 367294 ns 1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3792292 ns 3811479 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3712500 ns 3798000 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3740645.5 ns 3806125 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3681542 ns 3813437.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 718109 ns 724543 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33048296.5 ns 32785400 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10430124.5 ns 10852833 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1140363 ns 1313993.5 ns 0.87
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49865562.5 ns 49807062.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35500000 ns 35521583 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35527834 ns 35517479 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97086500 ns 97112834 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593681 ns 1611615 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1577160 ns 1049140 ns 1.50
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154458541.5 ns 153740041.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112337791.5 ns 112306083 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112343542 ns 112476667 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295487729 ns 295356541 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6504453 ns 6485483 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5864859 ns 5555702 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 21250 ns 15041.5 ns 1.41
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 21375 ns 18375 ns 1.16
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 20270.5 ns 16083 ns 1.26
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 23000 ns 15646 ns 1.47
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 22923.5 ns 21271 ns 1.08
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1127379 ns 1120492.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 223667 ns 200000 ns 1.12
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 28513 ns 27480 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11791 ns 10666.5 ns 1.11
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 10000 ns 9042 ns 1.11
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 10188 ns 9437.5 ns 1.08
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 18333.5 ns 17042 ns 1.08
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 261557 ns 267724 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9750936.5 ns 10072145 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1567125 ns 1541750 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 141680 ns 148171 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 26958 ns 7709 ns 3.50
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 26041.5 ns 8709 ns 2.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 29833 ns 10708 ns 2.79
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 27917 ns 9708.5 ns 2.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 140387.5 ns 129031 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3651619.5 ns 3486446 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 803646 ns 797791 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 244217 ns 234732 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11584 ns 10458.5 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11625 ns 9833 ns 1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11541.5 ns 11333.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11416 ns 9125 ns 1.25
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 629099.5 ns 638866 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23887109 ns 21816663 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 4918854 ns 4208187.5 ns 1.17
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 580556 ns 651461.5 ns 0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9417 ns 8625.5 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9979 ns 9729 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11583 ns 11521 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9334 ns 11042 ns 0.85
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 121984.5 ns 123974 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3546531 ns 3315044 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 896062.5 ns 859750 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 71303 ns 72471 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 14187.5 ns 17583 ns 0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13750 ns 13458 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13584 ns 15166 ns 0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 13083 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 595508 ns 608117 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21472632.5 ns 18976850.5 ns 1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4571041 ns 3989167 ns 1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 338302 ns 346933 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1791 ns 541 ns 3.31
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1750 ns 625 ns 2.80
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1791 ns 625 ns 2.87
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1667 ns 584 ns 2.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 40005 ns 35726 ns 1.12
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1233960 ns 1170850 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 275083.5 ns 255917 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 207819 ns 204512 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9500 ns 8604.5 ns 1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9437.5 ns 7625 ns 1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9250 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8958 ns 7584 ns 1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 230455 ns 237837 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 23652437 ns 23133813.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4903187.5 ns 4454021 ns 1.10
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 597027 ns 654907 ns 0.91
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 17937 ns 12208 ns 1.47
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 18604 ns 16208 ns 1.15
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 18792 ns 15542 ns 1.21
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 18542 ns 10229 ns 1.81
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 24371 ns 22887 ns 1.06
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1228812.5 ns 1146280 ns 1.07
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 209292 ns 183250 ns 1.14
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 190396.5 ns 190602 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32875 ns 31917 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32958 ns 32334 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 33125 ns 32334 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32875 ns 31792 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 280116 ns 282370 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11861919 ns 12675054 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1676666.5 ns 1664375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 547659 ns 592261 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 441854.5 ns 445708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 475312.5 ns 440416 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 444000 ns 446125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 443166 ns 462250 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194868 ns 194079.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6147280 ns 6009981 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1962250 ns 1948750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 365368.5 ns 368473 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3714333 ns 3828708 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3747417 ns 3827249.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3455458.5 ns 3829459 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3835083 ns 3834708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 543251.5 ns 555671 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29951646 ns 28291601.5 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8981208.5 ns 9332833 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1128911 ns 1362449 ns 0.83
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 831451229 ns 836902583.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 543434250 ns 545812333 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 542523083 ns 552742958 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1509475541 ns 1515431791 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22745955.5 ns 22773250.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 10453226 ns 14681704 ns 0.71
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3556189000 ns 3618929167 ns 0.98
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 2979880625 ns 1786520209 ns 1.67
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1804545917 ns 1811380625 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4761240750 ns 4749890834 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 370586856 ns 371829328 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 67278654 ns 89064682 ns 0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75875 ns 75813 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77521 ns 76708 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79125 ns 79437 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 78709 ns 76979 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 209735 ns 213831.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7790930 ns 7889207 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 525125 ns 504291 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 141925.5 ns 107541 ns 1.32
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 193708 ns 268729 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 280750 ns 283625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 197542 ns 204145.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 197375 ns 192875 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1052688.5 ns 1071904.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47347171 ns 42887765 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6146625 ns 5838812.5 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 618917 ns 632041 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199650333.5 ns 199435500 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139206333 ns 139086375 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139463625 ns 139238083 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389189291 ns 389003125 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5834182 ns 5834940 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 2538622.5 ns 3577266 ns 0.71
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619678062.5 ns 616747896 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 441556083 ns 438910291 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439833958.5 ns 439344770.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1179733292 ns 1178749375 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26679227 ns 26592537.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 16136881 ns 22013573 ns 0.73
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14208 ns 7292 ns 1.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 12792 ns 6291 ns 2.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 13292 ns 6250 ns 2.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16708 ns 9959 ns 1.68
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37547.5 ns 28590.5 ns 1.31
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1347839 ns 1242816 ns 1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 632375 ns 342708 ns 1.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 65412.5 ns 46790 ns 1.40
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225229.5 ns 214875 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228229.5 ns 220542 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229375 ns 223250 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221792 ns 207000 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 235566.5 ns 227888 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 34095690 ns 32088566 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9069291.5 ns 9056958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 583156 ns 532636 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10709 ns 7500 ns 1.43
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9667 ns 8459 ns 1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10812.5 ns 11166 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 10000 ns 10125 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 119996.5 ns 120432.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3386801 ns 3400864 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 865208 ns 833917 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 70562 ns 69170 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8208 ns 11687 ns 0.70
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 7875 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7834 ns 9083 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8042 ns 7791.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 522508.5 ns 540200 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19977756 ns 19905821.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4494666 ns 3738000 ns 1.20
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 328765 ns 316443 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7208 ns 500 ns 14.42
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 500 ns 14.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7167 ns 583 ns 12.29
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns 500 ns 14.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 35632 ns 26859 ns 1.33
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1285621 ns 1218948 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 313959 ns 487291.5 ns 0.64
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 53540 ns 46600 ns 1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16833 ns 12042 ns 1.40
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16417 ns 9500 ns 1.73
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16812.5 ns 10666 ns 1.58
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17417 ns 9375 ns 1.86
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 267006.5 ns 259067.5 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 24406042 ns 22720833.5 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5634958 ns 5032208 ns 1.12
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 418588 ns 388914 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 115145.5 ns 105209 ns 1.09
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 106812 ns 98958.5 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 110417 ns 100666 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 154333 ns 146584 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 27538 ns 26010 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1258504.5 ns 1202311.5 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 264333.5 ns 239416 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 210464 ns 191122 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 519625 ns 478959 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 480604 ns 490458 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 480083 ns 483458 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 479833.5 ns 519792 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 236269 ns 238157 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11815219 ns 11712742 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2113625 ns 2063166.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 580731 ns 609226.5 ns 0.95
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5312.5 ns 5459 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6520.5 ns 6937.5 ns 0.94
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7375 ns 6708 ns 1.10
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4417 ns 4479 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16222 ns 17171 ns 0.94
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 73207 ns 84830 ns 0.86
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 13167 ns 12709 ns 1.04
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11688 ns 11208.5 ns 1.04
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 12000 ns 11979.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17708 ns 16792 ns 1.05
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 219499 ns 219500 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 331220 ns 367374 ns 0.90
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 40250 ns 35250 ns 1.14
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51562.5 ns 51958 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52334 ns 53333 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13625 ns 13792 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 19967 ns 22473 ns 0.89
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 93615 ns 87211 ns 1.07
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 38042 ns 37208 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32479 ns 30979 ns 1.05
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32479 ns 32729.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 58542 ns 57375 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 198501 ns 198883 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 372867 ns 411165 ns 0.91
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 9708 ns 1708 ns 5.68
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 9791.5 ns 1917 ns 5.11
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 10667 ns 2208 ns 4.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 10792 ns 2020.5 ns 5.34
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 22678 ns 20890 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1185948.5 ns 1182894 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 306541.5 ns 198895.5 ns 1.54
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 29910.5 ns 34491 ns 0.87
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 3208 ns 2250 ns 1.43
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 3208 ns 2125 ns 1.51
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 3208 ns 2541 ns 1.26
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 3250 ns 2375 ns 1.37
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 206737.5 ns 209350.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8767043 ns 9223044 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1506229.5 ns 1571458 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 126807 ns 137241 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5000 ns 3979.5 ns 1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4833.5 ns 4916 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6333 ns 6167 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5875 ns 5562.5 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 145367 ns 148854.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5734563 ns 5416916 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 444458 ns 433541 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 62777 ns 69351 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8562.5 ns 8958 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8791 ns 8584 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8791.5 ns 9375 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8209 ns 8208 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 879051 ns 901778 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39241108 ns 39101068.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5552875 ns 5296271 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 374441 ns 390164 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 58292 ns 56792 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 59125 ns 57792 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 58917 ns 57667 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 60042 ns 58625 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 43183 ns 38676 ns 1.12
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1223886 ns 1256024 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 430916.5 ns 328000 ns 1.31
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 207588 ns 204982 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 451104 ns 454396 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 468729.5 ns 464875 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 466667 ns 465042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 435917 ns 433750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 266055.5 ns 274516.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27322606 ns 27766998 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8068750 ns 7963542 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 777504 ns 840618 ns 0.92
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3310708 ns 3290875 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2333646 ns 2340916.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2333854 ns 2344208.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6332209 ns 6314083.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204884.5 ns 205766 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 460746.5 ns 213542 ns 2.16
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11462875 ns 11352771 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8327520.5 ns 8308208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8340084 ns 8331229.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21238333.5 ns 21159458.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 730082 ns 735602 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 2032366 ns 1058910.5 ns 1.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5833.5 ns 3542 ns 1.65
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5917 ns 6646 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7375 ns 7333 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6750 ns 6875 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 138042.5 ns 141882 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5746357.5 ns 5384644 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 742000 ns 792000 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 62888 ns 56381 ns 1.12
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 9458 ns 0.81
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7583.5 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7666 ns 7250 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7791.5 ns 7458 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 759439.5 ns 774451.5 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 37720178 ns 37102116 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5234834 ns 5116062.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 366936 ns 368734 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 114708.5 ns 95500 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 119667 ns 95041 ns 1.26
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 121250 ns 101334 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 116791.5 ns 96958 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149606 ns 153183 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5876370 ns 5925151 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2026958 ns 2007167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 244046.5 ns 218112 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1934334 ns 2021874.5 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021750 ns 2010334 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2017250 ns 2025458 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026313 ns 2005917 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 716402 ns 723141 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32736584 ns 33170321 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10561729 ns 10803562.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 973261 ns 1255352 ns 0.78
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33458 ns 29750 ns 1.12
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 37667 ns 36291.5 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35333 ns 35000 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 625 ns 708 ns 0.88
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15397 ns 15831 ns 0.97
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 71093 ns 80041 ns 0.89
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3770.5 ns 3417 ns 1.10
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3791 ns 3000 ns 1.26
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3916.5 ns 2958 ns 1.32
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 3084 ns 2292 ns 1.35
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 143671.5 ns 144997 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 305020.5 ns 345563 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8583 ns 7167 ns 1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7292 ns 6208 ns 1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7459 ns 6042 ns 1.23
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 11250 ns 10458 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 42708 ns 37804.5 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1188962 ns 1127358 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 350459 ns 324750 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 58610 ns 48830 ns 1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215271 ns 213833 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222250 ns 221229 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222770.5 ns 220667 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208791.5 ns 206167 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 253600 ns 251783 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27159107 ns 25462835 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7909354 ns 7855917 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 541753 ns 579016 ns 0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 6417 ns 3917 ns 1.64
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 6375 ns 3958 ns 1.61
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 6375 ns 3917 ns 1.63
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 6167 ns 4167 ns 1.48
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 29010 ns 22588 ns 1.28
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2194926 ns 2083671 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 248395.5 ns 226542 ns 1.10
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 41658 ns 42771 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 23917 ns 14916 ns 1.60
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 24083 ns 15083 ns 1.60
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 24250 ns 14916 ns 1.63
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 23583 ns 14792 ns 1.59
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 343876 ns 316521 ns 1.09
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11696705 ns 11265875.5 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1015292 ns 963479.5 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 205739.5 ns 193022 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 130417 ns 101709 ns 1.28
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 122291 ns 99958 ns 1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 125625 ns 106041 ns 1.18
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 120145.5 ns 102208 ns 1.18
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148950 ns 142614 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5929635 ns 5689078 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2027208.5 ns 2045292 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 277129 ns 214192 ns 1.29
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923458.5 ns 1924667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1906812.5 ns 1842979 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1677187 ns 1918292 ns 0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1889625 ns 1901125 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 697015 ns 707209 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31755954.5 ns 31631954.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10391833.5 ns 10461667 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 945188 ns 1220282 ns 0.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18625 ns 16604 ns 1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18958 ns 18813 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21500 ns 21271 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18291 ns 18291 ns 1
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 109595.5 ns 111618 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3368268 ns 3369345 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1330250 ns 464208 ns 2.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 92573 ns 80435.5 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 253416.5 ns 216042 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 248958.5 ns 217458 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217250 ns 216708.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224708 ns 216395.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 523249 ns 534644 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19863537 ns 19551285.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6108125 ns 6104084 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 441706 ns 481515 ns 0.92
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24770.5 ns 23416.5 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 31416.5 ns 30395.5 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 29458 ns 28583 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1167 ns 1250 ns 0.93
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15968.5 ns 16607 ns 0.96
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 72855 ns 81651 ns 0.89
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 6083 ns 4729.5 ns 1.29
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5645.5 ns 4916.5 ns 1.15
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 6229 ns 5104.5 ns 1.22
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 5750 ns 4875 ns 1.18
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 213748.5 ns 212757 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 331398 ns 378384 ns 0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 305458.5 ns 303792 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 306500 ns 306416.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 308437.5 ns 308125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 307958 ns 306917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230416.5 ns 235352.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7810401.5 ns 7753901 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 910291 ns 895000 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 265765 ns 273893 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 532083.5 ns 532500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 590291 ns 561375 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 543916 ns 533875 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 533520.5 ns 538042 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1084806 ns 1115910 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46721471 ns 43545460 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6198229 ns 5736646 ns 1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 811042 ns 855458 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 39270.5 ns 18500 ns 2.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 39125 ns 23125 ns 1.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 41125 ns 20875 ns 1.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 38000 ns 20250 ns 1.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130774 ns 117298.5 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3722419 ns 3644245 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1485167 ns 475438 ns 3.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 97071 ns 79291 ns 1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215563 ns 213125 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228187.5 ns 227959 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216125 ns 214479.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214104.5 ns 212750 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 750448 ns 769273 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25517860 ns 26817998 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7039083 ns 7163750 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 494376.5 ns 536785 ns 0.92
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6437.5 ns 5292 ns 1.22
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7041 ns 6979 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7729 ns 8458.5 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7667 ns 6958 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 140363.5 ns 144689.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5733917 ns 5674338 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 775625 ns 763958 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 64415.5 ns 65951 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11146 ns 9833 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10333.5 ns 10395.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10625 ns 9875 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10334 ns 10166 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 826886 ns 843305.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 38205214 ns 40229475 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5258333 ns 5021354 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 381612 ns 388453.5 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5812.5 ns 5083 ns 1.14
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4896 ns 5645.5 ns 0.87
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7146 ns 7354 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6999.5 ns 7459 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 144336 ns 148525.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5913376 ns 5807141 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 776229 ns 768729 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 63198 ns 67441 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7917 ns 7459 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7834 ns 7750 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8084 ns 7583 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583.5 ns 7291 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 789077 ns 806597 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39034761 ns 38873703 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5581917 ns 5499042 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 386691 ns 394693 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14474084 ns 14393541 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10170208 ns 10086042 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10149541 ns 10132625 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27806083 ns 27847083 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 528374 ns 531501 ns 0.99
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 882024 ns 400094 ns 2.20
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46387208 ns 45837667 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33540770.5 ns 33412125 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33424500 ns 33550792 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85835875 ns 85694750 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2614141 ns 2655274 ns 0.98
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 4371261 ns 3296132 ns 1.33
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 86729 ns 65750 ns 1.32
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 85375 ns 69354 ns 1.23
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 90167 ns 68834 ns 1.31
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 86875 ns 67708 ns 1.28
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 134910 ns 125224.5 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3737313 ns 3321446 ns 1.13
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1480854 ns 478792 ns 3.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 249264 ns 228082 ns 1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 443229.5 ns 442083 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 443333 ns 452104 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 443291.5 ns 442208 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 446979 ns 444791 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 743129 ns 744155 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26210474.5 ns 26781484 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7557333 ns 7548250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 712153 ns 785568 ns 0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 500 ns 4
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 1916 ns 584 ns 3.28
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2000 ns 583 ns 3.43
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 1833 ns 541 ns 3.39
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 37906 ns 33459 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1219006 ns 1181669 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 295145.5 ns 266750 ns 1.11
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 53069 ns 47690 ns 1.11
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10708 ns 9104.5 ns 1.18
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11208 ns 8958 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11354.5 ns 9375 ns 1.21
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11354 ns 8333 ns 1.36
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 285548.5 ns 292729 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 22239786.5 ns 21877451 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5375687 ns 4421083 ns 1.22
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 378846 ns 376084 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9875 ns 9834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9834 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9834 ns 9834 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23103 ns 23819 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2142525 ns 1943243 ns 1.10
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 221500 ns 211083 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 204166 ns 209072 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 50334 ns 45958 ns 1.10
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 50292 ns 46375 ns 1.08
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 50625 ns 46167 ns 1.10
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 49459 ns 45542 ns 1.09
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 308895 ns 297740 ns 1.04
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13063091 ns 13019378 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 983146 ns 1008520.5 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 497502.5 ns 610991 ns 0.81
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 63375 ns 56250 ns 1.13
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 64208 ns 57125 ns 1.12
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 62750 ns 57125 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 64167 ns 57708.5 ns 1.11
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38397 ns 29558.5 ns 1.30
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1358037 ns 1212552 ns 1.12
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 623167 ns 345084 ns 1.81
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 219088 ns 204882 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 457000 ns 449291.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 475625 ns 482958 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 473458 ns 465791 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 454667 ns 434625 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 261525 ns 253081.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 36262740 ns 31946764 ns 1.14
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9115270.5 ns 9299875.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 818947.5 ns 887358.5 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 636625 ns 639500 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 656458.5 ns 610791 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 588125 ns 650021 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 576916 ns 613396 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 206500 ns 213054.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 9251545 ns 8304459 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1351250 ns 1377667 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 356344.5 ns 314248 ns 1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2240812.5 ns 2230375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2226896 ns 2241083 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2193750 ns 2226458 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2249750 ns 2044000 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 984855.5 ns 1009323.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 53834185.5 ns 48595808 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 12082667 ns 10250250 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1278034 ns 1209503 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 38958 ns 18583 ns 2.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 40333 ns 21500 ns 1.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 41959 ns 22084 ns 1.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 39500 ns 20333 ns 1.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 128340.5 ns 115629.5 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3938890 ns 3530676 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1412687 ns 529396 ns 2.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 96189 ns 79871 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222396.5 ns 219583.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 235062.5 ns 228750 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222312 ns 221395.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 234292 ns 219500 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 746419 ns 743488 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27274809 ns 26086313.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7501229 ns 7436521 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 509240 ns 556135 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7333 ns 500 ns 14.67
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7167 ns 584 ns 12.27
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7292 ns 584 ns 12.49
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7250 ns 500 ns 14.50
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 32772 ns 24005 ns 1.37
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1343594 ns 1194343 ns 1.12
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 375375.5 ns 283521 ns 1.32
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 59561 ns 47860 ns 1.24
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17250.5 ns 9979 ns 1.73
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17146 ns 10542 ns 1.63
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17041.5 ns 9687.5 ns 1.76
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17104.5 ns 9916.5 ns 1.72
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 281216 ns 274665.5 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24180264 ns 25054245 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5904229.5 ns 4901583 ns 1.20
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 396149 ns 403794 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10208.5 ns 7750 ns 1.32
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9041 ns 8541 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10583 ns 9458 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 10479 ns 10041 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 120981.5 ns 122963.5 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3539882 ns 3342683 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 875625 ns 828959 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69419 ns 70460 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7958 ns 7583 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 7875 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8187.5 ns 7917 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7833 ns 7208 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 510474 ns 521824.5 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17725223 ns 17096205 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4291812.5 ns 3622437.5 ns 1.18
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 315298 ns 323444 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 10437 ns 1375 ns 7.59
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 9125 ns 1708 ns 5.34
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 10250 ns 1875 ns 5.47
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 9437.5 ns 1584 ns 5.96
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 23754 ns 22394 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1177268 ns 1154621 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 302750 ns 310833 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 191221.5 ns 190371.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 3209 ns 1.32
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4291 ns 3333 ns 1.29
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4437.5 ns 3583 ns 1.24
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4209 ns 3500 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 226224 ns 224060 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10957018 ns 9920013 ns 1.10
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1683833 ns 1731417 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 537002 ns 581006 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 155333 ns 145687 ns 1.07
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 135542 ns 128584 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 139833 ns 129625 ns 1.08
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 235312 ns 226167 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 26871.5 ns 25004 ns 1.07
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1241545 ns 1165561.5 ns 1.07
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 301000 ns 248959 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 48781 ns 40870 ns 1.19
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 144583 ns 143604 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 123500 ns 130083 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 111854.5 ns 111208 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 265500.5 ns 251937.5 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 221031 ns 224391 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10748388 ns 10232573 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2036542 ns 1955250 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 240869 ns 267492 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8541 ns 7208 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7250 ns 6083 ns 1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6875 ns 6000 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 11417 ns 10458 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38877 ns 34049 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1252063 ns 1180224 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 351625 ns 325584 ns 1.08
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 59491 ns 50630 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221625 ns 219688 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232708.5 ns 237125 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230250 ns 228500 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215395.5 ns 212875 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263879 ns 270641 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27911566 ns 29882407 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8205708 ns 8193250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 553398 ns 592361 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15396 ns 14125 ns 1.09
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15583 ns 15291.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 17000 ns 16792 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15896 ns 16000 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 139189.5 ns 143262 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5507241.5 ns 5352196.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 783146 ns 756916.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 225460 ns 233592 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23833.5 ns 23895.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23229 ns 24041.5 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23792 ns 23542 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23500 ns 23667 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 871309.5 ns 888831 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40866393 ns 38279760.5 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5525209 ns 5301166.5 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 627300 ns 679602 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 26229.5 ns 8875 ns 2.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 28208 ns 9250 ns 3.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 30875 ns 11313 ns 2.73
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 27104.5 ns 9834 ns 2.76
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 140173.5 ns 126441 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3972092.5 ns 3425975 ns 1.16
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 837333 ns 886021 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 84467 ns 73581 ns 1.15
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15542 ns 14000 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16000 ns 14166.5 ns 1.13
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15959 ns 14541 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15812.5 ns 13875 ns 1.14
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 677783 ns 686454 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21242694.5 ns 21159530.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5255020.5 ns 5057854 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 357266 ns 368623 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9166.5 ns 6833 ns 1.34
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9250 ns 9645.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11521 ns 10959 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9437.5 ns 9125 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 124109.5 ns 125289 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3514817 ns 3340336.5 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 901187.5 ns 858667 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 75430.5 ns 73441 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13249.5 ns 12750 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12708 ns 12875 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13333 ns 12959 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12895.5 ns 12584 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 556362.5 ns 568824 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 20572384 ns 20335817 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4649292 ns 4008167 ns 1.16
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 339794 ns 341833 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 30083 ns 26604 ns 1.13
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33458 ns 35042 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31458 ns 31437.5 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1854.5 ns 1958 ns 0.95
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16188 ns 16488 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 73227 ns 80881 ns 0.91
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 6709 ns 5354 ns 1.25
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 6083 ns 5271 ns 1.15
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 6292 ns 5375 ns 1.17
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 7333 ns 6417 ns 1.14
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 145426 ns 144829.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 351952 ns 371354 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6833 ns 250 ns 27.33
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6708 ns 417 ns 16.09
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958 ns 375 ns 18.55
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6792 ns 334 ns 20.34
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 34946 ns 26201 ns 1.33
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1263721 ns 1213684 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 306583 ns 435084 ns 0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 57747 ns 47131 ns 1.23
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 13916.5 ns 6417 ns 2.17
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13667 ns 6666 ns 2.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13917 ns 6708 ns 2.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 14083.5 ns 6541 ns 2.15
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 201275 ns 192082.5 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 23279815 ns 23595307 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5587416.5 ns 4957208 ns 1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391916 ns 388663.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8750 ns 1917 ns 4.56
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8667 ns 2000 ns 4.33
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8709 ns 2042 ns 4.26
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8708 ns 1959 ns 4.45
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 35887 ns 26999 ns 1.33
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1236622.5 ns 1208214.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 324104.5 ns 281958 ns 1.15
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 221524 ns 206222 ns 1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23854 ns 16312.5 ns 1.46
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 24291.5 ns 17020.5 ns 1.43
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 24291.5 ns 16562.5 ns 1.47
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 23979 ns 16437.5 ns 1.46
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 289525 ns 281291 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 26815785 ns 25314200 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6199395.5 ns 5387167 ns 1.15
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 645064 ns 705642 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 156000 ns 148250 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 176458 ns 175104 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149333 ns 154500 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 148750 ns 148375 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 202436 ns 210020 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7935297 ns 7920169 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1474875 ns 1553375 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 189013 ns 236022 ns 0.80
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1044146 ns 1326125 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1311542 ns 1317625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1192416.5 ns 1267583 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1333875.5 ns 1330208 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 917103 ns 941055 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49231222 ns 46042204 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6709937.5 ns 9797270.5 ns 0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1050591 ns 1107606 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24771 ns 23542 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25854.5 ns 25167 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27042 ns 28437.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 26104.5 ns 24917 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 236772.5 ns 241297.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7699269.5 ns 7644187.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1068000 ns 558625 ns 1.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 123260 ns 114946.5 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 118500 ns 174646 ns 0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 129021 ns 167916 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119729 ns 119708.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 176750 ns 126750 ns 1.39
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1082699 ns 1108737 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45816693 ns 45003191 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6477521 ns 5870834 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 580755 ns 610886 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6792 ns 250 ns 27.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6708 ns 417 ns 16.09
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6959 ns 375 ns 18.56
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6792 ns 250 ns 27.17
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 32423 ns 23373.5 ns 1.39
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1227496 ns 1207385.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 440125 ns 274541 ns 1.60
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 57898 ns 47321 ns 1.22
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 13979.5 ns 6458 ns 2.16
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 14000 ns 6708 ns 2.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 14041.5 ns 6625 ns 2.12
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 14479.5 ns 6521 ns 2.22
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 218337.5 ns 207930.5 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26179966 ns 24020738 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6079833 ns 5321979 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393670 ns 394454 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7208.5 ns 5125 ns 1.41
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6125 ns 6000 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7167 ns 7375 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6958 ns 5500 ns 1.27
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 145657.5 ns 148415.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5681117 ns 5743209.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 482812 ns 438042 ns 1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233405.5 ns 233753 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 9708.5 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10396 ns 10500 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10583.5 ns 10292 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9916.5 ns 10000 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 910519.5 ns 921993 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 45597451 ns 40800221 ns 1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6444395.5 ns 5516833 ns 1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 630427 ns 673881.5 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22751 ns 22961 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2127785.5 ns 2040345 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 224292 ns 205708 ns 1.09
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 213458 ns 207722.5 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 8458 ns 4625 ns 1.83
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 8625 ns 4958 ns 1.74
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 8750 ns 4792 ns 1.83
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 8167 ns 4625 ns 1.77
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 245078 ns 232829.5 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9946027.5 ns 11262701.5 ns 0.88
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1678792 ns 1643083.5 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 466702 ns 580356 ns 0.80
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 26125 ns 8166 ns 3.20
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 25062 ns 8250 ns 3.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 28687.5 ns 9458 ns 3.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 26312.5 ns 8979.5 ns 2.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 138390.5 ns 124075.5 ns 1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3609992 ns 3484097 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 802229 ns 848979 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 82243 ns 73621 ns 1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10542 ns 8396 ns 1.26
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10416.5 ns 8584 ns 1.21
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10625 ns 9084 ns 1.17
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10145.5 ns 8334 ns 1.22
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 601697 ns 601403 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20934991 ns 21381887.5 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5069229 ns 4049604 ns 1.25
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 341518 ns 345603 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 124062.5 ns 123354 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129667 ns 130833 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130625 ns 130292 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 182895.5 ns 183083 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46031 ns 46276 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 102792 ns 100861 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 312959 ns 331291 ns 0.94
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 315270.5 ns 336312.5 ns 0.94
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 315645.5 ns 332416.5 ns 0.95
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 569500 ns 584792 ns 0.97
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 197211.5 ns 195249 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 437317 ns 504285 ns 0.87
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 400541.5 ns 396500 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 290708 ns 287958 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 290750 ns 288167 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 759167 ns 756292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 51447 ns 43813 ns 1.17
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1442664 ns 1397680 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 430208 ns 359646 ns 1.20
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 90178 ns 81271 ns 1.11
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1469166 ns 1447584 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1136083 ns 1133917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1148791.5 ns 1135166.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2371458 ns 2356062 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 288834.5 ns 251976 ns 1.15
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11609366 ns 10628240 ns 1.09
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1837458.5 ns 1770646 ns 1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 272018 ns 350644 ns 0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 654292 ns 641750 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 656208 ns 660333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 643771 ns 656625 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 660000 ns 541646 ns 1.22
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 202007.5 ns 206977 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8216678 ns 8394592 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1374958 ns 1331770.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 347449 ns 313564 ns 1.11
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2445250 ns 2445250 ns 1
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2448125 ns 2456229 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2446833 ns 2446833.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2449583 ns 2483750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1007847 ns 1018661.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51269988 ns 53769994.5 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10195042 ns 9019125 ns 1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1373375 ns 1436974 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33084 ns 28875 ns 1.15
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 36250 ns 36438 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34750 ns 34354 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 854.5 ns 833 ns 1.03
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15337.5 ns 15679 ns 0.98
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 72977 ns 79081 ns 0.92
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 4333 ns 3125 ns 1.39
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 4208 ns 3333 ns 1.26
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 4312.5 ns 3542 ns 1.22
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 4041.5 ns 3042 ns 1.33
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 144791.5 ns 141592 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 307384 ns 340828.5 ns 0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 415000 ns 404000 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 416625 ns 408458 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 416875 ns 407958 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 429041 ns 420750 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53909.5 ns 44015 ns 1.22
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1567275.5 ns 1346061 ns 1.16
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1167771 ns 1099750 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 360648.5 ns 240182 ns 1.50
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3815583 ns 3854416 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4001187.5 ns 3977416.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4004792 ns 3995708.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3808625 ns 3786812.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 258808 ns 247915 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 39426137.5 ns 38628061.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11450771 ns 11941666 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1225804 ns 1249207.5 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 4000 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33782 ns 34055 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1303530 ns 1242873 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 181771 ns 160875 ns 1.13
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 36288 ns 38220 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 19875 ns 15625 ns 1.27
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 20333 ns 15958 ns 1.27
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 20250 ns 15958 ns 1.27
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 19375 ns 15625 ns 1.24
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 270893 ns 257530 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8770612 ns 8798187 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 878395.5 ns 839395.5 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 171309.5 ns 167922 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404916 ns 403667 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 294792 ns 295750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296167 ns 295750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760459 ns 760166 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113227 ns 113514 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1035873.5 ns 1017055 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 465708.5 ns 326291.5 ns 1.43
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 92718.5 ns 87391 ns 1.06
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1467604 ns 1472208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1163084 ns 1161500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1164667 ns 1160625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2387167 ns 2378291 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 261443 ns 245391 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 12224045 ns 10232371 ns 1.19
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1895167 ns 1858625 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 312063 ns 356813 ns 0.87
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7041 ns 500 ns 14.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7000 ns 583 ns 12.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7000 ns 583 ns 12.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6959 ns 500 ns 13.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 34960 ns 26329.5 ns 1.33
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1143512 ns 1165109.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 304770.5 ns 458750 ns 0.66
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 218463 ns 207592 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 15208.5 ns 7458 ns 2.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 15250 ns 7958 ns 1.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 15125 ns 7833 ns 1.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 15250 ns 7500 ns 2.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 231132 ns 220362.5 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25193745 ns 24956286.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6197875 ns 4949916.5 ns 1.25
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 622512 ns 695677 ns 0.89
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 829416.5 ns 824979 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 619250 ns 619166 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 618624.5 ns 619291 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1546958.5 ns 1521750 ns 1.02
batchedmm(128, Bsize=32)/forward/GPU/CUDA 131374 ns 130530.5 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 214891 ns 228943 ns 0.94
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2617583.5 ns 2673291.5 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 2007958 ns 2003917 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2002687.5 ns 2004458 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4949021 ns 4938271 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 260037 ns 246670.5 ns 1.05
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 891295 ns 761778 ns 1.17
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1542 ns 291 ns 5.30
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1541 ns 375 ns 4.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1583 ns 333 ns 4.75
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1459 ns 250 ns 5.84
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 37528 ns 32758 ns 1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1208714 ns 1196400 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 287833 ns 263500 ns 1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 53039 ns 46921 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 6542 ns 1.26
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 6833 ns 1.26
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 6667 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8604.5 ns 6333 ns 1.36
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 223115.5 ns 229162.5 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22661444 ns 21326390 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5050750 ns 4918333 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 370542 ns 360398.5 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2380792 ns 2389042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2405875 ns 2375416 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2402999.5 ns 2399208 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2376458 ns 2395167 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 202379 ns 205752 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8175895.5 ns 7986200 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1391083 ns 1428354 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 410186 ns 375378.5 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4622209 ns 4650833 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4659083 ns 4663624.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4674875 ns 4666416.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4668000 ns 4657125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 906677.5 ns 922860 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47558050 ns 50907571 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6580750 ns 6979416.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1259277 ns 1386483.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 14042 ns 13458.5 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 14125 ns 7333 ns 1.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 16312.5 ns 7708 ns 2.12
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 14813 ns 6416.5 ns 2.31
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 25922 ns 23918 ns 1.08
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1196182 ns 1244282 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 269583.5 ns 235958 ns 1.14
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 47007 ns 40260 ns 1.17
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 33645.5 ns 46271 ns 0.73
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 45875 ns 63375 ns 0.72
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 34041 ns 52500 ns 0.65
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 57042 ns 33708.5 ns 1.69
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 221870 ns 220952 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 13439959 ns 10877336.5 ns 1.24
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2109584 ns 1059416 ns 1.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 241391 ns 264808 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21166 ns 20208.5 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 27146 ns 25708 ns 1.06
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24708 ns 24770.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5208.5 ns 5291 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA 17003 ns 17145 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 74790 ns 83681 ns 0.89
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 13084 ns 12646 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 11125 ns 10645.5 ns 1.05
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 11459 ns 10500 ns 1.09
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18937.5 ns 18146 ns 1.04
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 232004 ns 230722.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 354352 ns 371984 ns 0.95
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 409042 ns 405208 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 299834 ns 297166 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 300166.5 ns 297541 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 765625 ns 762459 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 55053.5 ns 46892 ns 1.17
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1389435 ns 1423487.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 419458.5 ns 335000 ns 1.25
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 99876 ns 88571 ns 1.13
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1481812.5 ns 1475875 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1176417 ns 1169208 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1178375 ns 1166834 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2397000 ns 2378771 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 321915 ns 287503 ns 1.12
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 14142027 ns 12647035 ns 1.12
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2086041 ns 2003291.5 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 277749 ns 380444 ns 0.73
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 435500 ns 432000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 438833 ns 436541 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 438375 ns 436708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 449792 ns 448208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 60743 ns 54845 ns 1.11
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1011327 ns 1004553 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1136916 ns 1035833 ns 1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 388225 ns 234772.5 ns 1.65
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3909292 ns 3891459 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4012979.5 ns 4027292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4023646 ns 4026478.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3817500 ns 3684083 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 264711 ns 268195 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31210175 ns 32271096.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10463083 ns 10269354.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1219017 ns 1382008.5 ns 0.88
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 11333 ns 8750 ns 1.30
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 10209 ns 7667 ns 1.33
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 10292 ns 7667 ns 1.34
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 14708 ns 12417 ns 1.18
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 31120 ns 24204 ns 1.29
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2299766 ns 2100905 ns 1.09
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 223896 ns 211416 ns 1.06
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210884 ns 209352 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 53250 ns 45042 ns 1.18
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 53500 ns 45791 ns 1.17
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53709 ns 45208 ns 1.19
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 53209 ns 44959 ns 1.18
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 373214 ns 348332 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13401439 ns 12300844.5 ns 1.09
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1767583.5 ns 1700187.5 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 523769 ns 655376 ns 0.80
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 124500 ns 121916.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 85958 ns 144917 ns 0.59
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88166 ns 88625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150334 ns 105229.5 ns 1.43
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189707 ns 189408.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6262796.5 ns 5999999 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1963583 ns 1936000 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 222936 ns 220412 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1958209 ns 2017208 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2014500 ns 2018750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2023209 ns 2014000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2028167 ns 2017500 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 539409 ns 544732 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27916416 ns 27836425 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9213437.5 ns 9082333.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 909308 ns 961460 ns 0.95

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit 897d842 into main Sep 10, 2024
59 of 72 checks passed
@avik-pal avik-pal deleted the ap/testing_dense branch September 10, 2024 21:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant