Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix!: remove deprecations for 1.0 release #82

Merged
merged 7 commits into from
Aug 30, 2024
Merged

fix!: remove deprecations for 1.0 release #82

merged 7 commits into from
Aug 30, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Jul 10, 2024

Copy link

codecov bot commented Jul 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.41%. Comparing base (8dc51b0) to head (e47e8ba).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
- Coverage   83.68%   80.41%   -3.28%     
==========================================
  Files          38       38              
  Lines        1900     1899       -1     
==========================================
- Hits         1590     1527      -63     
- Misses        310      372      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@avik-pal avik-pal force-pushed the ap/1.0 branch 7 times, most recently from aeaaaf9 to dba7835 Compare July 27, 2024 23:12
@avik-pal avik-pal force-pushed the ap/1.0 branch 2 times, most recently from ae5f2ad to a156e06 Compare August 2, 2024 05:40
@avik-pal avik-pal force-pushed the ap/1.0 branch 4 times, most recently from 19ae927 to 259549d Compare August 18, 2024 15:34
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: e47e8ba Previous: 8dc51b0 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5583 ns 5833 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5958 ns 6209 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7209 ns 6500 ns 1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6708 ns 6333 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 117750 ns 118732 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2860850 ns 2968100 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 3361583 ns 730042 ns 4.60
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 421144 ns 417444 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9916.5 ns 9834 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9833 ns 9937.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9917 ns 10083 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9625 ns 10083 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 553140 ns 577266 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18595297 ns 19534378 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2382917 ns 2672542 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 696305 ns 679157 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1625 ns 1583 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1688 ns 1875 ns 0.90
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2958.5 ns 1666 ns 1.78
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1437.5 ns 1583.5 ns 0.91
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21723.5 ns 21231 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1340515 ns 1454941.5 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 208292 ns 209312 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 37181 ns 30810.5 ns 1.21
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3750 ns 4125 ns 0.91
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4167 ns 4083 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4291.5 ns 4375 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4459 ns 4083 ns 1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 145687 ns 141204.5 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8279576 ns 8535587 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1490500 ns 1628312.5 ns 0.92
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 148211.5 ns 151661 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39750 ns 40125 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39958 ns 39792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 82833 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37422.5 ns 36293 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 578922.5 ns 561260.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1029729.5 ns 992500 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78625.5 ns 82050.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2019625 ns 2036834 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085458 ns 2075792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2085375 ns 2052042 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000666 ns 1989479.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 231656 ns 223552.5 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7765871 ns 8096655 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7650583 ns 7643042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1504421 ns 1110381 ns 1.35
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149083.5 ns 145625 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146917 ns 154708.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150000 ns 174688 ns 0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147250 ns 154145.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165605 ns 165157 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7579764 ns 7006708 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1671333 ns 1598583 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185332 ns 185621 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1120208 ns 1111542 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1112249.5 ns 1113792 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1119979 ns 1117416 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1115458.5 ns 1116375 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 697776 ns 667136.5 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31396155 ns 33531181.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6206875 ns 6722500 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1041688 ns 916229 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5125 ns 4208.5 ns 1.22
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 5083 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5583 ns 4875 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4208 ns 4375 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 93533.5 ns 88783 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5284344 ns 5683274 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 465584 ns 465729 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 59600 ns 71591 ns 0.83
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8708 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 8625 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 8625 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8459 ns 9042 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 598346 ns 582943 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 33477623 ns 37889755 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6013458.5 ns 5975583.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390743 ns 389426 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17354.5 ns 18250.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17520.5 ns 18833.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18916 ns 18541 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18708.5 ns 18021 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67316 ns 65828.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2841743 ns 2797898 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1301187.5 ns 1292083.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75270.5 ns 77641 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 211750 ns 212959 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221250 ns 212416 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212979.5 ns 223375 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220959 ns 219958 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355350 ns 345170 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12887660 ns 13000126.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5578875 ns 5618187 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 475568.5 ns 472507 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583.5 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 750 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 709 ns 708 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20658 ns 20278 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1117791 ns 1174845 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 293750 ns 284041.5 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 32570 ns 34141 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1437.5 ns 1375 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1458 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1334 ns 1375 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 125407.5 ns 122996.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8435986 ns 8936472 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1520937 ns 1545542 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 124981 ns 128652 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 5416 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5458 ns 5334 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10292 ns 10125 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24335 ns 23494.5 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1244509 ns 1206688.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 613771 ns 352291.5 ns 1.74
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46950 ns 48921 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221791 ns 265208 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 263562.5 ns 228583 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267459 ns 268375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258042 ns 220208 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 191390.5 ns 191406 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31212923 ns 34275082 ns 0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9028021 ns 9545416 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 615105 ns 615580 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4166 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4084 ns 4084 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23747 ns 23388 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2059889 ns 1884551 ns 1.09
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 224375 ns 222625 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48710.5 ns 50581 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16417 ns 16500 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16666 ns 16541 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17166.5 ns 16666 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16500 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 196190.5 ns 191032 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10575444.5 ns 9654050 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 1220604 ns 1315416 ns 0.93
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 178941 ns 179153 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511375 ns 511083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 332250 ns 332542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 331958 ns 332750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865541 ns 865000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113960 ns 113564 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 396373 ns 397782 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 455458 ns 399542 ns 1.14
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 247962 ns 249264 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2265542 ns 2268937 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1741145.5 ns 1755645.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1750125 ns 1746583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3194667 ns 3196292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 240998 ns 236643 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12033885 ns 9269331 ns 1.30
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1913833 ns 1892000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 763086 ns 761836.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6625 ns 6167 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6104 ns 6250 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7709 ns 7875 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6709 ns 6292 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 90571.5 ns 90951 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5303527.5 ns 5183601.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 773833.5 ns 790084 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60371 ns 60171 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11104.5 ns 9729.5 ns 1.14
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11541.5 ns 11833.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11583.5 ns 10709 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11041 ns 11250 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 621523 ns 631820 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38208156 ns 38968720 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5786083 ns 5635041.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 413623 ns 413756 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23874 ns 22959 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2259468 ns 2250193 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 228959 ns 229979.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 51460 ns 51060 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2209 ns 2083 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 218615 ns 238043 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10999156 ns 12339690 ns 0.89
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1993375 ns 1997542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 179811 ns 176033 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9375 ns 8458 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9709 ns 8604.5 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11667 ns 10250 ns 1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9083 ns 8458 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 107558 ns 111812 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3162682 ns 2954218 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 851875 ns 809875 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 77041 ns 75421 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17958.5 ns 17729.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 19042 ns 17854 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18833 ns 18479.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17541.5 ns 17500 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 597196.5 ns 612415.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17134824 ns 16447833 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5474187 ns 5303292 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387393 ns 386655 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35659 ns 35148 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1185739.5 ns 1185387 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 293083 ns 379167 ns 0.77
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 47871 ns 45811 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9583.5 ns 8625.5 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9333.5 ns 9625 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9270.5 ns 9833 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8937.5 ns 8979.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 259605 ns 266322.5 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18447546 ns 19024975 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5011104 ns 5023625 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 374128 ns 376345 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398875 ns 398458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215584 ns 215375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215083 ns 215625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755958 ns 756084 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111970 ns 110416.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 332768 ns 325801 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 386416 ns 380603.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 79430 ns 78551 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1388333 ns 1395208.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 857833 ns 859166.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 858042 ns 860417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2356750 ns 2356542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 207644 ns 203387 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 8781675 ns 10253444.5 ns 0.86
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1598729 ns 1668583 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322992.5 ns 324309.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7312.5 ns 7521 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7709 ns 7208 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 7937.5 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns 7354.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 143151 ns 146147.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 6385139 ns 5499314 ns 1.16
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 448750 ns 448604 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 61490 ns 60691 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15020.5 ns 14937.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14291 ns 13604.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14625 ns 13667 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15854 ns 15375.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 972405 ns 955436.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 49449537.5 ns 43131702 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5975895.5 ns 5899125.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 437174 ns 433397 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25083 ns 24125 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25125 ns 24708.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28500 ns 28229 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25687.5 ns 24895.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 201048 ns 196723 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8216742 ns 7737736 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1213500 ns 1117208 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 118471 ns 117742 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 145292 ns 103770.5 ns 1.40
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 147625 ns 117375.5 ns 1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 113812.5 ns 147541.5 ns 0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 148562.5 ns 159541 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1082636 ns 1058384 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42070250 ns 44485069 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5751979 ns 5929750 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 601145 ns 590519 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74542 ns 75041 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76959 ns 75021 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80042 ns 76729.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74000 ns 85708 ns 0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210048 ns 203053 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7715013 ns 7420031.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 543000 ns 532041.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 125206 ns 125262 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 301875 ns 274937.5 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 284709 ns 306333 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 211791 ns 314500 ns 0.67
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 299770.5 ns 291333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1128416 ns 1113767.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43276791 ns 41752889.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6601000 ns 6339625 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 702246 ns 696159 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16958 ns 16375 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17500 ns 17166 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18667 ns 17708.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16292 ns 16500 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 148481 ns 149324.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5705815.5 ns 5632259 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 524625 ns 451041 ns 1.16
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239632 ns 238583.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27313 ns 25041.5 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 28000 ns 27458.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26959 ns 27208 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25250.5 ns 27417 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 984571.5 ns 967445 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 39101861 ns 42171811 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6098270.5 ns 5985271 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 714026 ns 714285 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11084 ns 10541 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11417 ns 10708 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 12167 ns 1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11083 ns 10416 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 126073 ns 124817.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3886871.5 ns 3419436 ns 1.14
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 831084 ns 811375 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239702 ns 240213 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21542 ns 21834 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22083 ns 21917 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21958 ns 22667 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21667 ns 22875 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 706566 ns 693267 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 20349369 ns 20616607 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5568417 ns 5554812 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 687945 ns 675879 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62437.5 ns 63875 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63291.5 ns 65458 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 66458 ns 68750 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62750 ns 63291 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107293.5 ns 106862 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3660159 ns 3236758 ns 1.13
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1323062.5 ns 1339728.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 239692 ns 237523 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 490792 ns 436417 ns 1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 443541 ns 449729 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 450500 ns 447750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 437917 ns 486125 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 517439 ns 515853 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21639679 ns 20960752 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6076541.5 ns 6146771 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 731271.5 ns 715734 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7208.5 ns 7250.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7521 ns 7041.5 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9042 ns 8708 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns 6916.5 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 146603.5 ns 146046 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 6392989 ns 5957852 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 458687.5 ns 454417 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59111 ns 59271 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 14771 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15062.5 ns 15479 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14937.5 ns 15062 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15125 ns 14000 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 952699.5 ns 942484.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 40798622 ns 39118999 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5781813 ns 5667729 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 405764 ns 407846 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6157292 ns 6158125.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3225250 ns 3218166 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3226625 ns 3227708 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11915500 ns 11925375 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 350478 ns 351461 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 296627.5 ns 299264 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19132270.5 ns 19150312.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11022312 ns 11075104 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11088416 ns 11106625 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36416791.5 ns 36514875 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1067365 ns 1053961 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1157365 ns 1154031 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 917 ns 958 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1042 ns 1000 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 1041 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23582 ns 23131 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2239100 ns 2063993 ns 1.08
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 288125 ns 232479 ns 1.24
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 214282 ns 213903 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3708 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3709 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3625 ns 3667 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 282213 ns 280249 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11258687 ns 11110622 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2144000 ns 2136458 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 648245 ns 645129 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8625 ns 8250 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8875 ns 7791.5 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9667 ns 9125 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8667 ns 8396 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 122810 ns 121638.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3572398.5 ns 3248289.5 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 748125 ns 788916 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69920 ns 67611 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11875 ns 11729.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13208.5 ns 12271 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13292 ns 13459 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11791.5 ns 11770.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 647865 ns 639448 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22137350 ns 20615290 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 4443083 ns 5086271 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 365428 ns 366630 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22406 ns 22523 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2171056 ns 2092333 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 227500 ns 223666.5 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 53190 ns 52621 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2958 ns 2875 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 2959 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3416 ns 3042 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 204810 ns 203283 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 10200655 ns 9008155 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1731833 ns 1643667 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 161696.5 ns 171352 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11833 ns 10209 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11792 ns 11875 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13333 ns 13000 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11000.5 ns 11291 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 123504 ns 122118 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3318538 ns 3370469 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 955979 ns 932041 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238252 ns 239973.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23083.5 ns 20833 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22292 ns 20771 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21917 ns 21541.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21417 ns 22729 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 601694.5 ns 592817 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21113099.5 ns 20103668 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4708208 ns 4792708 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 667256 ns 667099 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4458 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24698 ns 24053 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2115886.5 ns 2139501 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 222604 ns 223416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 54070 ns 54331 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16167 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16375 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16459 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16229.5 ns 16312.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 332326 ns 328788 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12612468 ns 12357389.5 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1596583 ns 1610333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 216202 ns 214938 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2084 ns 2042 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2166 ns 2042 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2208 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 2000 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36354 ns 36532 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1200010 ns 1144768 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 315833 ns 338417 ns 0.93
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 208602 ns 206372 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17583 ns 17708.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17687.5 ns 19145.5 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17500 ns 18687.5 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17375 ns 17583.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295580 ns 294488 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20027287 ns 21056777.5 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5448166 ns 4806541.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 694207 ns 704000 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 58875 ns 61291.5 ns 0.96
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60625 ns 60708 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 61083 ns 61791 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51792 ns 51625 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66673 ns 66466 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 94791 ns 97471 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 159167 ns 193333 ns 0.82
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 144395.5 ns 132604 ns 1.09
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 135645.5 ns 153021 ns 0.89
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 218917 ns 255166.5 ns 0.86
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 218558 ns 218241 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 584845 ns 583953 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85042 ns 83208 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82792 ns 82958 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84000 ns 87041.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80708 ns 86458 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190714 ns 191093 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5317512.5 ns 5412302 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2004833 ns 1964604.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171982 ns 170373 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918646 ns 1871250 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920146.5 ns 1923625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922500 ns 1926625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1926417 ns 1695083 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 537463 ns 533673 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25759235 ns 27973144 ns 0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8892541.5 ns 8716646 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086039.5 ns 1083959 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21729 ns 21925 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2156001.5 ns 2103570 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 340438 ns 323625 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 46710 ns 45100 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1791 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 254431 ns 253156 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9970916 ns 9676564 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1487229 ns 1486062.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 187061 ns 183853 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8583 ns 9250 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10709 ns 8791 ns 1.22
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11104 ns 11541.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8625 ns 9833 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 120562 ns 119759.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3392235 ns 3304871 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 896875 ns 911583 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 237707 ns 242563 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10541 ns 9084 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10708 ns 9083 ns 1.18
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10145.5 ns 9875 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9583 ns 10542 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 533774 ns 528445 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21000394 ns 20929851 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4276334 ns 4465959 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 627816 ns 649598 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58187.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39375 ns 39583 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39645.5 ns 39833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83709 ns 83167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40383 ns 39718.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1402434 ns 1335928 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1146125 ns 1144792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78591 ns 76941 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924459 ns 1876458.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974917 ns 1982000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1974604 ns 1975334 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1871479 ns 1876084 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 223824 ns 223366 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33746868 ns 33121559 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11596333 ns 11069000 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1032529 ns 1033133.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418208 ns 419792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 419792 ns 419416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 425479 ns 420417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 424000 ns 417833 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 214170 ns 209830.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7394482 ns 7621895 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 544334 ns 539709 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 285852 ns 287624 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 760250 ns 670083 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 674416 ns 762791.5 ns 0.88
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 735937.5 ns 739541 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 698562.5 ns 764667 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1053639 ns 1045546 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44748386 ns 42506282 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6735604 ns 6380125 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 919018.5 ns 921656 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3464521 ns 3366854.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3422854 ns 3432979 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3395334 ns 3458292 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3468645.5 ns 3357375 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 175338 ns 176639 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8033050 ns 8129736 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1413708.5 ns 1393270.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 428514 ns 423500.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6215854.5 ns 6223146 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6089541.5 ns 6217459 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6227000 ns 6240625 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6184187.5 ns 6221312.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1007413 ns 997179 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51999371 ns 50529292.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7858354.5 ns 8164709 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1564094 ns 1566429.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 474958 ns 473000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 253500 ns 254042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 253250 ns 254542 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 901666 ns 902333 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46720 ns 46242.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 389374 ns 825428 ns 0.47
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 425291 ns 517333 ns 0.82
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 250522 ns 250313 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2250542 ns 2279166.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1761750 ns 1761750 ns 1
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1761166 ns 1764396 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3198959 ns 3193125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 269668 ns 268875.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 8958333 ns 13207390 ns 0.68
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2163500 ns 2166292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 786867 ns 784110 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57792 ns 57375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39333 ns 39292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39791 ns 39541 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83417 ns 83667 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28664 ns 28000 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 978884 ns 1420961.5 ns 0.69
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1153500 ns 1133895.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75511 ns 78041 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2028042 ns 1783500 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2056020.5 ns 2087458 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2086000 ns 2091417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1945229 ns 1973375 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 236349.5 ns 235065 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 38351191 ns 34323841 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11477729 ns 11467646 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1054329.5 ns 1053243 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58541 ns 57500 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39875 ns 39791 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 40167 ns 39875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82792 ns 83333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50484.5 ns 49753 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 809837 ns 807009.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1111291 ns 1110750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78201 ns 71821 ns 1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1881500 ns 1870083 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1941229.5 ns 1974791.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1971250 ns 1975458.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896833 ns 1719417 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 242923.5 ns 242025 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 18110920.5 ns 17950511 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9855750 ns 9840104.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 930267 ns 928181 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35373 ns 35044 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1269712 ns 1224421 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 440854.5 ns 279916 ns 1.57
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47780 ns 50520 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6791.5 ns 6083 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 7041 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6959 ns 7542 ns 0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958 ns 6583 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 215041.5 ns 212138.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20712502.5 ns 20858604.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4803791.5 ns 4933020.5 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 375493 ns 377125 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32582 ns 32102 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1261691 ns 1246143 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 254541.5 ns 252500 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 43691 ns 40121 ns 1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2958 ns 3250 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3167 ns 2833 ns 1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2958 ns 3417 ns 0.87
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2834 ns 3166 ns 0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 189747.5 ns 187793.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 8052543 ns 7423467 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 938459 ns 930666 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 162191.5 ns 159252 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 450145.5 ns 426395.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 446959 ns 423458 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 429042 ns 453437.5 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 422229 ns 422541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138250 ns 138012 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6236248.5 ns 6078596 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2128729 ns 2105875 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 373698.5 ns 351154 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3793417 ns 3627187.5 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3811000 ns 3781646 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3814875 ns 3818708.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3787042 ns 3816750.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 714820 ns 714220.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33262062 ns 32708263 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10779334 ns 10437208 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1498493.5 ns 1330337 ns 1.13
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49901250 ns 49952500 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 25981417 ns 25992042 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 25983500 ns 25974771 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97079479.5 ns 97060375 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1594678 ns 1609718.5 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1014749 ns 1005437.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154541375 ns 154751187.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 88793000 ns 88411625 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 88530458 ns 89142125 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 294936604.5 ns 295023146 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6471554 ns 6525541 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5536819 ns 5541499 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 17979 ns 17458.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 15459 ns 15417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 13000 ns 13916 ns 0.93
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15146 ns 15187 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20648 ns 20963 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1156334 ns 1029086 ns 1.12
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 224875 ns 221417 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27171 ns 27290 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11125 ns 10625 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7729 ns 7687.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 7854.5 ns 7895.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17250 ns 17333.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 263885.5 ns 262988 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9825365 ns 11032315 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1608208 ns 1558750 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 152662 ns 153002 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9000 ns 7917 ns 1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8771 ns 8333.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10396 ns 11125 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8833.5 ns 8250 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 116927 ns 116148 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3541234 ns 3496720 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 800667 ns 797854 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 240932 ns 240663 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9729.5 ns 10021 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10187.5 ns 10083.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 10791.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9604 ns 10584 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 626219.5 ns 627842 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26581249 ns 22890536.5 ns 1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5185125.5 ns 4718917 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 668776 ns 670993.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10187.5 ns 9271 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9792 ns 9541 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10917 ns 10875 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9292 ns 9270.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 124324.5 ns 122880.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3445385.5 ns 3253148 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 931250 ns 918333 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72601 ns 73381 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13791 ns 15083 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15042 ns 14167 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 17042 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14875 ns 14667 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 599265 ns 595348 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19837253 ns 19444920 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4467250 ns 4763896 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 354288 ns 353084 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 666 ns 459 ns 1.45
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 459 ns 459 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35015 ns 35417 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1242520 ns 1186574 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 426916.5 ns 416604 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 208692 ns 209112 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8937.5 ns 8979 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9375 ns 10292 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 10416.5 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8917 ns 8729.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 236066 ns 233445.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21882713 ns 21282401 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5349250 ns 5435416.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 656976 ns 676048 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15792 ns 15708 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 13416 ns 14583 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 12583.5 ns 12416 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10979 ns 9937 ns 1.10
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22506 ns 21468 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1159535 ns 1188974 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 197166 ns 204687 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 187121.5 ns 182912 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32125 ns 32083 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32125 ns 31979 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32125 ns 32583 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32042 ns 31916 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 281987.5 ns 277811 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11256812 ns 11129104.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1704270.5 ns 1607584 ns 1.06
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 602756 ns 603987 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 439208 ns 443583 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 440312 ns 441395.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 442437.5 ns 443312.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 452250 ns 439937.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194353 ns 194190 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6187612 ns 5958005 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2010833.5 ns 1994958 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 371658.5 ns 350285 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3836125 ns 3816917 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3828416.5 ns 3836875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3833375 ns 3840729.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3804958 ns 3801625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 543630.5 ns 546260 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28316634 ns 28675319 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9576666 ns 9200208 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1217281 ns 1220685 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 781279958 ns 783919458 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 418024250 ns 415090937.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 415003958 ns 416149396 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1553302312.5 ns 1556394646 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22534687 ns 22758802.5 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14053357 ns 14026629 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2540355333 ns 2531412125 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1525674250 ns 1503429375 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1510867083 ns 1511972625 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 5211355166 ns 5238183333 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 372139138 ns 341968825.5 ns 1.09
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88484108 ns 89112141 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76833.5 ns 76084 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 80438 ns 77666 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79375 ns 79333 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 77000 ns 88708 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210860 ns 209926 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7939257.5 ns 7624963 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 556833 ns 538459 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 110121 ns 111431 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 194333 ns 193479 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 196250 ns 195396 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 280604.5 ns 255791 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 209625 ns 263084 ns 0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1044977.5 ns 1056306 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44213920 ns 42921068 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6328458 ns 6096396 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 635831 ns 638587 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 200007250 ns 199996979.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 103851687 ns 104048375 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 103904750 ns 103857041 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388866500 ns 389154708 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5820988 ns 5838520 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3429802 ns 3416961 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 621011562.5 ns 619738166.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 351243917 ns 352609750 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 354523166 ns 353140208 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1184086167 ns 1179908250 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26473310 ns 26709121 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21855057 ns 21908376.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5416 ns 5292 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 9958 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28684 ns 27949 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1197511 ns 1220698 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 675667 ns 445770.5 ns 1.52
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48700 ns 49951 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215000 ns 214667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221959 ns 222250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221958.5 ns 222083.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215917 ns 217354.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 219373 ns 226060 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33465057 ns 31786029 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9200792 ns 9164667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538594 ns 535067 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9271 ns 8208.5 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8708 ns 7375 ns 1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10417 ns 10708 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7562.5 ns 8021 ns 0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 117805.5 ns 118426 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3416067 ns 3289610 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 906500 ns 894959 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 74050 ns 75861 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8458.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9021 ns 8458 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11583 ns 11000 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 8625 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 511809 ns 525710 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18851406 ns 19476117 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4467000 ns 4580750 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 318773 ns 323803.5 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 709 ns 542 ns 1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 709 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25714 ns 26694 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1273031 ns 1232825 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 450458.5 ns 334666 ns 1.35
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 48810 ns 51101 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16084 ns 12833 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 12146 ns 11125 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 12125 ns 12708 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 11334 ns 11375 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 250965 ns 255932 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23303938.5 ns 23574064 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5365646 ns 5957187 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 389799 ns 393244 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106291 ns 106916 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 84625 ns 84416 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 86166 ns 85416 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146500 ns 146729 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24955 ns 24228 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1163867 ns 1231419 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 262458 ns 259958 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 185231.5 ns 188842 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 479417 ns 478583.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 519521 ns 479354.5 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 481771 ns 479416 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 504437.5 ns 522125 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 230703 ns 234731 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11688664.5 ns 11445580 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2205416 ns 2175521 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 617466 ns 622217.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5875 ns 5208 ns 1.13
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 7333 ns 6958 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7000 ns 7167 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6312.5 ns 5020.5 ns 1.26
batchedmm(16, Bsize=32)/forward/GPU/CUDA 15960 ns 17348 ns 0.92
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79085.5 ns 79231 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 13208 ns 12875 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10667 ns 12083 ns 0.88
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11167 ns 12667 ns 0.88
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17083.5 ns 17708.5 ns 0.96
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 211295 ns 217078.5 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 373923 ns 388595 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39209 ns 39625 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50708 ns 50625 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 51083 ns 51000 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13541.5 ns 13666.5 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 21656 ns 20461 ns 1.06
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 80316 ns 83341 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37833 ns 38542 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32083 ns 29917 ns 1.07
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 30083 ns 31417 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57250 ns 66000 ns 0.87
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 189684 ns 195656.5 ns 0.97
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 406643.5 ns 398885 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1916.5 ns 1770.5 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1625 ns 1.15
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2292 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1791.5 ns 1729.5 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20698 ns 21146 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1151725 ns 1123716.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 318416.5 ns 302958 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 30320 ns 28491 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2208.5 ns 2229.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2416 ns 0.90
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 2375 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2166 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 201882 ns 205300 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8916839 ns 9074561 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1544750 ns 1516937.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 138336.5 ns 138212 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6271 ns 5645.5 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4854.5 ns 4771 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6312.5 ns 6604 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5042 ns 4979.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 143284 ns 147775 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5900971 ns 6128313 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 567291.5 ns 450875 ns 1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 62250 ns 62371 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8958.5 ns 8958 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 8750 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9438 ns 9125 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8687.5 ns 9625 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 864087 ns 883717 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 38587534 ns 41518756 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5770625 ns 5658500 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391113 ns 388034 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56750 ns 56709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56916 ns 56833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57000 ns 56917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58250 ns 58292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37169 ns 38043 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1177949 ns 1221995 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 363395.5 ns 611541 ns 0.59
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 207172 ns 207452 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 451792 ns 450937.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 467500 ns 466917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 466312.5 ns 468562.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 441500 ns 473167 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 264130.5 ns 271371 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27819058 ns 26618792 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8272125 ns 8082167 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 816138 ns 807824 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3324375 ns 3309813 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1763125 ns 1763625 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1769958 ns 1772167 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6313291.5 ns 6307500 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205480 ns 206270.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 205832 ns 211692.5 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11532979 ns 11489208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6556937.5 ns 6543312.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6559812.5 ns 6593875 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21146833 ns 21174666.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 740245 ns 735714 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1073505 ns 1071922.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6083 ns 6437 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4917 ns 5125 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6000 ns 7604.5 ns 0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5209 ns 6021 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 137073 ns 141217 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5763070.5 ns 5736528 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 761979.5 ns 743958 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 58111 ns 58020 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7750 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13625 ns 8791 ns 1.55
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7416 ns 7417 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7146 ns 8084 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 748959 ns 759240 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 37270377 ns 35174267 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5566708.5 ns 5288042 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 379893 ns 379024.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97625 ns 97583 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 95708 ns 101959 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97708 ns 127542 ns 0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122083 ns 96084 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149717.5 ns 153040 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5815106.5 ns 5764876 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2046458 ns 2076375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186092 ns 184732 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2032063 ns 1822416 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2035417 ns 2035833.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2025750 ns 2031521 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2034812.5 ns 2029667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 699402 ns 712381 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33359753 ns 32235030 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10780625 ns 10817667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1123756 ns 1119068 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32895.5 ns 32771 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35958 ns 34958 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 32292 ns 33834 ns 0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15283 ns 16070 ns 0.95
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 80500 ns 80701 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2625 ns 2645.5 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3291 ns 4250 ns 0.77
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3042 ns 3083 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2334 ns 2979.5 ns 0.78
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 137331.5 ns 140484 ns 0.98
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 346583 ns 362954 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7250 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5416 ns 5375 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36591 ns 37558 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1255336 ns 1203117.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 361167 ns 351958 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48660 ns 50591 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212187 ns 215229 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221041.5 ns 223042 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221312 ns 221041.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206750 ns 216292 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 242479 ns 247737.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26291133.5 ns 28210154.5 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8127688 ns 7826917 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 523004.5 ns 518941 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 4000 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3959 ns 3958 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3959 ns 3959 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21550 ns 22280 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2157769 ns 2135337 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 248541 ns 244750 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45911 ns 45821 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14667 ns 14708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14708 ns 14708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14750 ns 14750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14708 ns 14708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 304620 ns 313766.5 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 10977767 ns 11565919.5 ns 0.95
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1043625 ns 996417 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 199612 ns 196698 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 106000 ns 102375 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 98291.5 ns 98375 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102542 ns 130667 ns 0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 128833 ns 101541 ns 1.27
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135179 ns 142696 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5987975 ns 6012180 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2089312.5 ns 2060042 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186671 ns 185242 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921917 ns 1678708 ns 1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1911646 ns 1919562.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1921417 ns 1925646 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1918250 ns 1715750 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 683849 ns 697882 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32200299 ns 32586423 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10813854.5 ns 10270770.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1072735 ns 1227914 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17875 ns 20125 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18042 ns 18666 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21499.5 ns 20125 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18541 ns 19041.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107857 ns 111256 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3405654 ns 3316785.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1342500 ns 1342375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81381 ns 77136 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216416.5 ns 216708 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226729 ns 217270.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217666.5 ns 217000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 228458.5 ns 257500 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 513449.5 ns 522548.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19313650 ns 19703098 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5992125 ns 6106875 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 473434 ns 495696 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 23937.5 ns 23625 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28875 ns 28917 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 26500 ns 27167 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1416 ns 1542 ns 0.92
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15770 ns 16593 ns 0.95
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82311 ns 83321 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4833 ns 4937.5 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5000 ns 4709 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5166 ns 5125 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4625 ns 5479 ns 0.84
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 205185 ns 210967 ns 0.97
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 383233 ns 384204.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 307000 ns 304709 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 307333 ns 305417 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 309000.5 ns 307312.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306959 ns 304999.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227362 ns 231440.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7656672 ns 7899776.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 650375 ns 1048666.5 ns 0.62
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 275572 ns 278713 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 537562 ns 531667 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 532667 ns 537916 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 535625 ns 559833 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 542458 ns 535042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1070334 ns 1077983 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44223125 ns 46672590 ns 0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6462771 ns 6185542 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 869258 ns 867079 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19500 ns 21000 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19958 ns 19792 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23375 ns 21333.5 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19958 ns 20125 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112543.5 ns 115430.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3491900 ns 3543630 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1414625 ns 1426729 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77381 ns 77991 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213208 ns 212667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213479 ns 214292 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215042 ns 213916 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214958 ns 219958 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 750498 ns 758463 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24681158 ns 25339852 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7223895.5 ns 7150812.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 544035 ns 549146 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 6666 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 7000.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8292 ns 8374.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6500 ns 6396 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 138363 ns 144368 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5558527 ns 5600145 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 777187 ns 781083 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69061 ns 69300 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10375 ns 10917 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10291 ns 10041.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10708.5 ns 10791 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9709 ns 11250.5 ns 0.86
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 819633 ns 829126 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 39220243 ns 38035335 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5518708 ns 5400125 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 385803 ns 389489 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5958 ns 6333 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 5291 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6917 ns 7042 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4729.5 ns 4562.5 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 142280 ns 146644 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5842056.5 ns 5614464 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 769459 ns 767750 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 59561 ns 60400 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7209 ns 7583 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 7750 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7917 ns 7625 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 8625 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 778165 ns 788273 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39496840.5 ns 39532384.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5854688 ns 5788792 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 404144 ns 390144 ns 1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14575750 ns 14512959 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7731500 ns 7746083 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7698583 ns 7719437.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27811541 ns 27824167 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 535283 ns 532712 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 407233 ns 405110 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46519437 ns 46254125 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26552479.5 ns 26514813 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26436334 ns 26596375 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85626334 ns 85595417 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2913979.5 ns 2648732 ns 1.10
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3300841 ns 3291677 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66500 ns 69916 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 66709 ns 66666.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68312.5 ns 67604 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67500 ns 69812.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105648 ns 119643.5 ns 0.88
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3451369.5 ns 3502655.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1470250.5 ns 1447479.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234332 ns 236773 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 440250 ns 480313 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 441125 ns 447125 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 445625 ns 447937.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442624.5 ns 444459 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 729654 ns 735182 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26852660 ns 27836501.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7754417 ns 7344541.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 803477.5 ns 795239 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32133 ns 32854 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1168342.5 ns 1222475 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 351479 ns 464063 ns 0.76
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49250 ns 50950 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8875 ns 8250 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9271 ns 8687.5 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8667 ns 9646 ns 0.90
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 15771 ns 0.58
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 283467 ns 289332 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 22237807 ns 22396972 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5030812.5 ns 5647520.5 ns 0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 384844 ns 389255 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9834 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9834 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9834 ns 9791 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23024 ns 23549 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2093062 ns 2127803 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 223166 ns 223688 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216402 ns 215812 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45750 ns 45583 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45583 ns 45833 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46000 ns 45834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45750 ns 45792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 285399.5 ns 292557 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 9799339 ns 11637949 ns 0.84
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 968750 ns 1005416 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 625876 ns 620161.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56250 ns 56250 ns 1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56458 ns 56375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56459 ns 56458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57917 ns 57750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28644 ns 29238.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1187526 ns 1197390 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 631292 ns 658208 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205262 ns 204172 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 459458 ns 451791.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465375 ns 471500 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497666.5 ns 468000 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 476896 ns 441791.5 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243906 ns 250364.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33120338 ns 32745444 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9379625 ns 10042062.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 852638 ns 848179.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 586000 ns 581125.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 645146 ns 649645.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 591042 ns 657583 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 660999.5 ns 614250 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 206101 ns 209963 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8668014 ns 8555661.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1370250 ns 1375959 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 238977 ns 264153 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2245208 ns 2243542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2238291.5 ns 2233479 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2233812.5 ns 2247312 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2238042 ns 2249041 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 956767.5 ns 981693 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48728968 ns 47646947 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7240916.5 ns 7438458 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1384018 ns 1260099 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19458 ns 25000 ns 0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19687.5 ns 19625.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22667 ns 21959 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 31250 ns 19167 ns 1.63
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111255 ns 114255 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3455101.5 ns 3641620.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1420333.5 ns 1425646 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78081 ns 82081 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225333 ns 256541.5 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221583 ns 220250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228792 ns 221687.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 232292 ns 221750 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 721610 ns 733642 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25857434.5 ns 27659496.5 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7765792 ns 7468958 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 566570 ns 559661.5 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22767 ns 23294 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1215266.5 ns 1199626 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 452375 ns 380395.5 ns 1.19
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 52441 ns 50321 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9083.5 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10271 ns 10167 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10625 ns 10271 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9875 ns 11333 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 265487 ns 269037 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25062592 ns 25065409.5 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6018666 ns 5606334 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 420344 ns 414904 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10500 ns 8583 ns 1.22
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8062.5 ns 8458 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10292 ns 10458 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 7625 ns 1.19
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 118808 ns 121505 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3418649 ns 3438400 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 885187.5 ns 884250 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 71665.5 ns 69061 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 7333.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7875 ns 7542 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7916.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 8000 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 505134.5 ns 512016 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17565081 ns 18614285.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4294625 ns 4265271 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 329113 ns 331073.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1541 ns 1334 ns 1.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1708 ns 1625 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1895.5 ns 2000 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1375 ns 1458 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21715 ns 20878 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1184887 ns 1144746 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 308875 ns 305042 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 187651 ns 191532 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3375 ns 3375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 3375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3542 ns 3708.5 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3291 ns 3458 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 219260 ns 220885.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10744223.5 ns 10272002 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1724187.5 ns 1658437.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 593095 ns 594146 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148145.5 ns 149042 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 106334 ns 106104 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 107187.5 ns 107459 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 233354 ns 225625 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23884 ns 24697 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1182223 ns 1197055 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 300000 ns 300625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36950 ns 38181 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 144520.5 ns 144084 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 87687 ns 100709 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 87792 ns 87937.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 251833 ns 263895.5 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 216029 ns 219366 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10660743 ns 11143376 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2107666.5 ns 2064125 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 239532 ns 226117.5 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5333 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5334 ns 5334 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10292 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32714 ns 33744 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1156018.5 ns 1208626.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 352875 ns 394645.5 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 53221 ns 50650 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220062.5 ns 220458.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228729.5 ns 236458 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228833 ns 229542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224271 ns 213437 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 260760 ns 266362.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27980428 ns 26810792 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8578229 ns 8119062.5 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 534445 ns 532916 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14917 ns 15250 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15312.5 ns 14812.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16708 ns 16792 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14834 ns 15292 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 139169 ns 142309 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5708443 ns 5521569 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 797834 ns 788458 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 241323 ns 239123 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23646 ns 23209 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23812 ns 24208 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23958 ns 24104.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23709 ns 23500 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 860230 ns 874682 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38347185 ns 39249650.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5856625 ns 5835021 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 700376.5 ns 702463 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8834 ns 10062.5 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9792 ns 9792 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11083 ns 11375 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8916 ns 9250 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 122964 ns 124966.5 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3472918 ns 3573835 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 908125 ns 826250 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 74011 ns 71705.5 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 13250 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14146 ns 14021 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15354 ns 14833 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 14250 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 660283 ns 673097 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21362028 ns 21882593 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5340833 ns 5231334 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 375084 ns 372554 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9687.5 ns 10083.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10520.5 ns 9333 ns 1.13
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11625 ns 10917 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9583 ns 9791 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 121557 ns 124389 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3413782 ns 3411999.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 932500 ns 932500 ns 1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 71111 ns 71241 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12791 ns 12625 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12500 ns 12625 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13708 ns 13313 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12166 ns 12375 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 548626 ns 557333 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 20221743.5 ns 19402473 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4648729 ns 4633542 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 350268.5 ns 348154 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29604.5 ns 29708 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 31542 ns 31750 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 30375 ns 29667 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1833 ns 1834 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15946 ns 16586 ns 0.96
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 74191 ns 74511 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5125 ns 5292 ns 0.97
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4791.5 ns 4542 ns 1.05
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5291.5 ns 5375 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6375 ns 6667 ns 0.96
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138206 ns 142234 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 374353 ns 371734 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25381 ns 26130 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1154493.5 ns 1255720 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 446500 ns 468750 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48930 ns 48500 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6209 ns 6542 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 6542 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6583 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6167 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 184823.5 ns 190203.5 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 23775568 ns 23758213 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5401167 ns 5392792 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393993 ns 393904 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2000 ns 2000 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 1958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25927 ns 27189 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1174327 ns 1199767 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 313812.5 ns 312750.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 208652 ns 208272 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16875 ns 15916.5 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16666 ns 16291 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16291.5 ns 16979 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16687.5 ns 16312.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 271953.5 ns 276740.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 28538231.5 ns 24755518 ns 1.15
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5705375 ns 5979167 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 711016.5 ns 715538 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148084 ns 180833 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 164437 ns 151333.5 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150583.5 ns 179000 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 184958 ns 147562.5 ns 1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198930 ns 207596 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7772893 ns 7810338 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1453625 ns 1464083.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 196832 ns 195132 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1306854 ns 1308625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1304812.5 ns 1320417 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1334500.5 ns 1326167 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1335563 ns 1318250 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 896336.5 ns 915789.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 44103385 ns 47829317 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6551250 ns 6477041 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1123231 ns 1020372 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26000 ns 26333 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25229 ns 24750 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27479.5 ns 27709 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24791 ns 29458.5 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235714.5 ns 237299.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8360389 ns 7668370 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 618125 ns 1182167 ns 0.52
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106221 ns 121321 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 180291.5 ns 181812.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 119292 ns 118083 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119104.5 ns 129000 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 133396 ns 118458 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1061050 ns 1085787 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47965532 ns 43559074 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6177667 ns 6188875 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 624876 ns 603482 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22572 ns 23112 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1251092 ns 1222588 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 324125 ns 395646 ns 0.82
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48860.5 ns 48781 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6042 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6833 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7020.5 ns 6729.5 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6479.5 ns 6354 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 201712.5 ns 206261.5 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24190754 ns 24411973 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5519229 ns 5650084 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393274 ns 392024.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083 ns 6166 ns 1.15
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6479.5 ns 5417 ns 1.20
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8375 ns 8250 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6334 ns 6416 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 144445.5 ns 148283 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5784772 ns 5523038 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 451791 ns 469750 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 237723 ns 237302 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9708.5 ns 10354 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10354.5 ns 10166 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10312.5 ns 10291 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 10125 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 894173 ns 909984 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41046274 ns 43302207 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6098750 ns 5927833 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 677436.5 ns 689088 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 708 ns 0.88
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 625 ns 1.13
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22187 ns 22992 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2028286 ns 2053209 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 228479.5 ns 227000 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 214902 ns 215913 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4708 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4625 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4584 ns 4583 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 222465 ns 228362.5 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10522858 ns 10246488 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1645875 ns 1762500 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596496 ns 596946 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8229.5 ns 8791.5 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9083.5 ns 8021 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10208.5 ns 10208.5 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7834 ns 8625 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 121070 ns 123762 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3530511 ns 3582537 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 831083 ns 795292 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 70060 ns 70171 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8500 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8958 ns 9084 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9333 ns 9291 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8250 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 586511 ns 599222 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21323888.5 ns 22439265 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4802708.5 ns 4920229 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 354733 ns 352418.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 125292 ns 126375 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 96708 ns 96167 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 97250 ns 96396 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183416 ns 183208 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45670 ns 46448 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 99990.5 ns 94021 ns 1.06
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 302791 ns 302354.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 168083 ns 168625 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 166833 ns 178500 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 607229.5 ns 568625 ns 1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 189831.5 ns 193426.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 489695 ns 485945.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398375 ns 398500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215333 ns 214958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215125 ns 215459 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756459 ns 755958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43130 ns 43652 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1398407.5 ns 1354730.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 412042 ns 489291.5 ns 0.84
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83571 ns 83401 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1405604.5 ns 1416708 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 863250 ns 861208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 861479.5 ns 863229.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2358542 ns 2359083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 249090 ns 249519.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10996775 ns 11581786 ns 0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1820250 ns 1843542 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 355383 ns 354834 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 611208 ns 651104 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 648500 ns 636792 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648812 ns 662104.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 662875 ns 581792 ns 1.14
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194388.5 ns 204117 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8240834 ns 7983269 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1397562 ns 1360250 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 254103 ns 255778 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2466021 ns 2460458 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2458875 ns 2454583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2463604.5 ns 2468375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2452250 ns 2463875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 981852.5 ns 992828 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51226623.5 ns 53061666.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7566875 ns 7675854 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1486799.5 ns 1495551 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32542 ns 32562.5 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34750 ns 34584 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32229.5 ns 32583.5 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 917 ns 833 ns 1.10
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15560 ns 15923 ns 0.98
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 78491 ns 73991 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3083 ns 3145.5 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3479.5 ns 3416 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3334 ns 3458.5 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3125 ns 3084 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136477.5 ns 139769 ns 0.98
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 359243 ns 346409 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 407250 ns 407417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 402125 ns 401791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 401833 ns 401916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421584 ns 421167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43081.5 ns 43360 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1443601.5 ns 1424417 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1160541.5 ns 1149708 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 242377.5 ns 244183 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3877250 ns 3883958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3991438 ns 3996708.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3995500 ns 3992125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3778791.5 ns 3780895.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 240481 ns 246111 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36046095 ns 36934379 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11740520.5 ns 11631750 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1247192 ns 1246158.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33151 ns 33757 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1246525 ns 1234748.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 178083 ns 181500.5 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40841 ns 43060 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15459 ns 15500 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15708 ns 15583 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15792 ns 15666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15584 ns 15541 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 250190 ns 256020 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9448692 ns 10686428 ns 0.88
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 867250 ns 870458 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 170041 ns 178281 ns 0.95
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404041 ns 404000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221437.5 ns 220792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 221041 ns 221375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760667 ns 760833 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112818 ns 113651 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1033657 ns 1020025 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 396500 ns 412687.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 90181 ns 91036 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1428625 ns 1438417 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 887375 ns 887125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 886333 ns 888167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2382896 ns 2384958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 235417.5 ns 242637 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 9699881 ns 9528776 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1899708.5 ns 1851667 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 356683 ns 357334 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25300 ns 25949.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1213264.5 ns 1192514 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 303583 ns 296583.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208412 ns 211622 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7083 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7833 ns 8000 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7854.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7333 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 208958.5 ns 216752 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26254851.5 ns 24950983 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5627271 ns 5888042 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 691516 ns 701642.5 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 828417 ns 813667 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 465812.5 ns 465792 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 471166.5 ns 467791 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1541979 ns 1544375 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130118 ns 132054 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 178896.5 ns 162431 ns 1.10
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2704041 ns 2686208 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1527521 ns 1528708 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1546750 ns 1538542 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4937042 ns 4933917 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 239281 ns 240514 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 775748 ns 859970 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31418 ns 32094 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1183060 ns 1252325 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 307687.5 ns 323021 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48455.5 ns 48681 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 5917 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6333 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6562.5 ns 6792 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6312.5 ns 6083 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 222050 ns 223941.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22089679 ns 23466112 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5038937.5 ns 5053625 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368804 ns 369274 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2384708 ns 2397083 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2406334 ns 2379291 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2401187.5 ns 2394625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2400334 ns 2379250 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198668 ns 200806.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7953848.5 ns 8223452 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1483208 ns 1521917 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 357533 ns 359128.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4652749.5 ns 4667500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4657895.5 ns 4598667 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4677042 ns 4663834 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4656375 ns 4654084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 891976 ns 896769 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50384184.5 ns 49138075.5 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6325542 ns 6734812.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1259107 ns 1407615 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6792 ns 7479.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7000 ns 7125 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6917 ns 7125 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7375.5 ns 8020.5 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23575 ns 23691.5 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1197552 ns 1204234 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 263166 ns 260979.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 37341 ns 33710 ns 1.11
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 52604 ns 44792 ns 1.17
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 45604 ns 33042 ns 1.38
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 49875.5 ns 33459 ns 1.49
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 66000.5 ns 71791.5 ns 0.92
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 219530 ns 217114 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 11373243 ns 10571611 ns 1.08
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2074458 ns 2004833 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 240673 ns 241352 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20750 ns 20458.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24750 ns 24625 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22083.5 ns 22625 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5958 ns 6041 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16981 ns 17905 ns 0.95
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 86491 ns 86031 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12041 ns 11958 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9333 ns 9417 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9625 ns 9500 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18083 ns 18000 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 230179 ns 230114.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 380594 ns 377559 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406208 ns 406625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223541 ns 223250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 223145.5 ns 223833 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762667 ns 762833 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46914 ns 46575 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1384166 ns 1399200.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 415875 ns 406583 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89301 ns 89521 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1427084 ns 1445834 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 891979 ns 892854.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 891958 ns 893333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386312.5 ns 2385770.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 287696.5 ns 281827 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 12534990 ns 11465517 ns 1.09
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2042416 ns 2034937.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 375789 ns 378964 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433959 ns 434333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 430208 ns 430667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 430208 ns 430166 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447500 ns 447292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 55750 ns 55027 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 987998 ns 1009771.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1135146 ns 1109791.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236932 ns 236872 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3911708 ns 3915542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4023250 ns 4022187.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4023416 ns 4023854 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3815521 ns 3802354 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 261796.5 ns 265046 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33894952 ns 31022310 ns 1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10609979 ns 10484042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1239582 ns 1238903.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8792 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 6875 ns 6916 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 6916 ns 6875 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12459 ns 12458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23476 ns 23854 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2195009 ns 2189159 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 226375 ns 227167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 218422 ns 216382 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44667 ns 44833 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 44875 ns 45000 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45416 ns 45083 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44834 ns 44750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 341928 ns 339090 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11449609 ns 13813520 ns 0.83
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1758917 ns 1746834 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 662766 ns 671917 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85687.5 ns 87063 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82125 ns 92271 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88250 ns 125250 ns 0.70
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87750.5 ns 88396 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190673 ns 189900.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5914536.5 ns 5870133 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1998792 ns 1961729.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 208012 ns 204047 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027062.5 ns 2028417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018395.5 ns 2022208.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2022916.5 ns 2025000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2027750 ns 2024000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 529341.5 ns 536109.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27514758 ns 30231842.5 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9494875 ns 9333542 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1104271 ns 1104742 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/1.0 branch 2 times, most recently from 1f16397 to 1a3d7fa Compare August 21, 2024 14:59
@avik-pal avik-pal force-pushed the ap/1.0 branch 4 times, most recently from 790b513 to 38f9941 Compare August 29, 2024 19:10
@avik-pal avik-pal merged commit ef784ed into main Aug 30, 2024
74 of 75 checks passed
@avik-pal avik-pal deleted the ap/1.0 branch August 30, 2024 21:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant