Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

Commit

Permalink
chore: bump crate-ci/typos from 1.25.0 to 1.26.0 (#174)
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.25.0...v1.26.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
dependabot[bot] authored Oct 16, 2024
1 parent 301b59c commit 604783f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/typos@v1.25.0
uses: crate-ci/typos@v1.26.0

1 comment on commit 604783f

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 604783f Previous: 301b59c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5125 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5250 ns 6937.5 ns 0.76
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7708.5 ns 7417 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5416 ns 6083 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 113361 ns 104885 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2795172 ns 2678307 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 601544 ns 401685 ns 1.50
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9729.5 ns 9917 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9938 ns 10042 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 10750 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 11063 ns 9729 ns 1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 544547 ns 495998 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17852957 ns 18744208 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 629346 ns 680377 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1500 ns 1458 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1458 ns 1541.5 ns 0.95
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1771 ns 1750 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1583 ns 3187.5 ns 0.50
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 20770 ns 20316 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1342503 ns 1305124 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 30997 ns 31190.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4104 ns 4334 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4500 ns 4041 ns 1.11
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4500 ns 4083 ns 1.10
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4333 ns 4354 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 134970 ns 134077 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8677498 ns 8979794 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 138579 ns 148416.5 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57666.5 ns 57500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46875 ns 46667 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 39917 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81458 ns 83500 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36587 ns 37564 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 582336 ns 567840.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 69420 ns 80616 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030375 ns 2038666 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2088625 ns 2081166 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2086625 ns 2084042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1998562 ns 1991875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 217216 ns 223666 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8077777 ns 7677352 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 930850 ns 1187113 ns 0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 175083 ns 146541 ns 1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147291 ns 148041.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150021 ns 151625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 151750 ns 176750 ns 0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166825 ns 166355.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7358467.5 ns 7478548 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 262570 ns 190117 ns 1.38
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1115103.5 ns 1106833.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1110771 ns 1109708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1113771 ns 1125750 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1136250 ns 1112687.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 639845.5 ns 654461 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33057102 ns 33783553 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 864075 ns 1021271 ns 0.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3792 ns 5333 ns 0.71
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4479 ns 5125 ns 0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6583 ns 5750 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6375 ns 5084 ns 1.25
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 85209.5 ns 83746 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5875726.5 ns 5563998.5 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 59531 ns 61491 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8417 ns 8792 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8750 ns 8625 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9042 ns 9250 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8958 ns 8417 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 557500.5 ns 559136 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 34838164 ns 34995936.5 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 370833 ns 392504 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17958 ns 17083 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16458 ns 18000 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21125 ns 18791.5 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17292 ns 17708.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 63776.5 ns 63135.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2927491.5 ns 3027434.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 82870 ns 74881 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212625 ns 218791 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213042 ns 212063 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212771 ns 213375 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212291 ns 218250 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 329859 ns 334874 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12611094 ns 15538427 ns 0.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 405232 ns 465885 ns 0.87
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 708 ns 0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 792 ns 1.10
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 709 ns 667 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 19101 ns 19376 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1145778 ns 1181689 ns 0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 26409 ns 30801 ns 0.86
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458 ns 1375 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1334 ns 1417 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1500 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 117126.5 ns 115818 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8850213 ns 8578264 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 115676 ns 125221.5 ns 0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7291 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 6125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 5458 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10375 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23587 ns 24404 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1261233 ns 1185331.5 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 52723 ns 47150 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229167 ns 259875 ns 0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230667 ns 239750 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267875 ns 238375 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257458 ns 212937.5 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 182744 ns 194467.5 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32590762.5 ns 30488731 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 548449.5 ns 603521 ns 0.91
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 4084 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 4125 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 4084 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 22860 ns 23361 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1933593 ns 1914869 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 39504 ns 47581 ns 0.83
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17042 ns 16958 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16875 ns 16875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17083 ns 16667 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16875 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 185787.5 ns 186194.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10029430 ns 9861733 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 162052 ns 172361.5 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 491583 ns 490917 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 385625 ns 385541 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 386458 ns 313292 ns 1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 844083 ns 846958.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113763 ns 113486.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 418213 ns 398692.5 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 388657 ns 245177.5 ns 1.59
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2155583 ns 2139937 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1863374.5 ns 1863583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1865167 ns 1584583.5 ns 1.18
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3377520.5 ns 3114083 ns 1.08
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 229580 ns 229713.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 9922983 ns 11955773.5 ns 0.83
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 610962 ns 745073 ns 0.82
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6500 ns 7104 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5500 ns 6792 ns 0.81
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7667 ns 7083 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5167 ns 6229.5 ns 0.83
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 84720.5 ns 83179 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5300415 ns 6726845 ns 0.79
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 59932 ns 59261 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11229 ns 10250 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11395.5 ns 11458 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12334 ns 11895.5 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10667 ns 11166.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 602168 ns 592614 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38613143.5 ns 37936205.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 383917 ns 410389 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23328 ns 23257 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2178076 ns 2214765 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 41367 ns 48421 ns 0.85
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2167 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2166 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2208 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 228927.5 ns 230148 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11774524 ns 11848732 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 165900 ns 178962 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9584 ns 8917 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8333 ns 8917 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9895.5 ns 10083 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8542 ns 9208 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 105241 ns 99883.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3103348.5 ns 3281834 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 71955 ns 73811 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17688 ns 17438 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16666.5 ns 17125 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18708 ns 19125 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17562 ns 17375 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 595171 ns 574862.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 16252508 ns 17368412 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 358129 ns 382279 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 458 ns 583 ns 0.79
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34578 ns 34631 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1237584 ns 1211808 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 41387 ns 48701 ns 0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9229 ns 9146 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8958.5 ns 9521 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 10229.5 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8104 ns 8604 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 257823 ns 260130.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18331589 ns 19439989.5 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 349944 ns 367164 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397270.5 ns 396854.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288083 ns 288229.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288666.5 ns 215042 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751792 ns 755958 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112022 ns 112250 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 349915 ns 328996 ns 1.06
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 74609 ns 75451 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1454270.5 ns 1462500 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1130500 ns 1136041 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1131583 ns 860334 ns 1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2437959 ns 2439875 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 200057 ns 199853.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 7687949 ns 9985334 ns 0.77
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 302285 ns 324698 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7750 ns 7000 ns 1.11
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7083.5 ns 7687.5 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8312.5 ns 8375 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6687.5 ns 7499.5 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 139766 ns 138856 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5685169 ns 6055720 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 60383 ns 60111 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13479.5 ns 15874.5 ns 0.85
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 12750 ns 16271 ns 0.78
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15125 ns 15792 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14625.5 ns 13125.5 ns 1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 923489 ns 911828 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42519536.5 ns 42608795.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 407432 ns 429664 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25625 ns 24000 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23666 ns 25958 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 29417 ns 26833.5 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24041 ns 24937.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 186240.5 ns 189463 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7554376 ns 7536335 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 120505 ns 112782 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 152187 ns 146084 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 145250 ns 152541.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 146917 ns 105833 ns 1.39
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103958 ns 153500 ns 0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1013659 ns 1027043 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44493070 ns 41813684 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 535240 ns 587426 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74583 ns 74042 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79584 ns 84500 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76791.5 ns 74917 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76083 ns 74333 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 190594.5 ns 195104 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7364811 ns 7388961 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 121316.5 ns 121551 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 273562.5 ns 281250 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 304084 ns 290833 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 303333 ns 244667 ns 1.24
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 307583 ns 297125 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1045024 ns 1044893 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 39473308 ns 40287331 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 624192 ns 693978 ns 0.90
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12417 ns 12583.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12896 ns 13333.5 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14000 ns 14000 ns 1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12500 ns 13125 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 138416 ns 137568.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5479910 ns 5655781 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 226152 ns 235892 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27792 ns 27458 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26458 ns 28437 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28437.5 ns 27583 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 33937.5 ns 25396 ns 1.34
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 924126.5 ns 925629.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42086872 ns 42183215 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 610976 ns 696807 ns 0.88
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11124.5 ns 10583.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10333 ns 11729 ns 0.88
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12479.5 ns 14020.5 ns 0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11125 ns 11166 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 118543.5 ns 119207 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3443799.5 ns 3459447 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 233176 ns 241797.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22291.5 ns 22228.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22417 ns 22979 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 24167 ns 24041 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 28562.5 ns 22958 ns 1.24
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 668341 ns 679984 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21034051 ns 21093495.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 569113 ns 675492 ns 0.84
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 68709 ns 65145.5 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 62750 ns 69062 ns 0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67520.5 ns 67375 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 64417 ns 63250 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 102389 ns 102654.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3441143 ns 3365331 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 230751 ns 244962 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 506375 ns 512250 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 510167 ns 511875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 475209 ns 467958.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 647896 ns 464791 ns 1.39
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 492781 ns 497974 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20664230 ns 19959026 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 593680 ns 716037 ns 0.83
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7958 ns 7458 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6750 ns 7479.5 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8208 ns 8791.5 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7562.5 ns 7000 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 137965 ns 136611.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5508177.5 ns 5668588 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 62687 ns 59181 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16125 ns 16084 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16250 ns 16104 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16250 ns 15145.5 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14833 ns 15292 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 900927 ns 892529 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 39349971 ns 37494483 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 388286 ns 399300 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6150354 ns 6148250 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6368167 ns 6373958.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6373937.5 ns 3229667 ns 1.97
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11915167 ns 11910625 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 345749 ns 348836 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/oneAPI 49052559 ns 48313142 ns 1.02
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 388426 ns 303493 ns 1.28
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19083437.5 ns 19111312.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19960479.5 ns 19956500 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19966834 ns 11118833 ns 1.80
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 37142104 ns 36495125 ns 1.02
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1072087 ns 1010983 ns 1.06
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI 78467188 ns 77165819 ns 1.02
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1035750.5 ns 1185177 ns 0.87
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 958 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1000 ns 1042 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1042 ns 1042 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 959 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23415 ns 23306 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2079171 ns 2102392 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 200906 ns 210582 ns 0.95
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3917 ns 3958 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4000 ns 3959 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4041 ns 4041 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5458 ns 3958 ns 1.38
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 270573.5 ns 274898 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10484095 ns 10835037 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 486775 ns 633051.5 ns 0.77
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8687 ns 7667 ns 1.13
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7459 ns 8271 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9334 ns 10167 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7834 ns 7999.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 116220 ns 116562 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3435001.5 ns 3281805 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 71133 ns 68781 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12125 ns 11833 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11958 ns 12271 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13000 ns 13500 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11750 ns 12209 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 609643.5 ns 610392 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21784602 ns 20835527 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 341729 ns 356904 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22413 ns 22489 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2035110 ns 2031329 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 44053 ns 49170 ns 0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 3042 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2917 ns 2958 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3208 ns 3209 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2916 ns 2834 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 194923.5 ns 196092.5 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9225861.5 ns 9721843.5 ns 0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 154488.5 ns 166151.5 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11542 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10500 ns 12584 ns 0.83
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12875 ns 13333.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11875 ns 10959 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 115370 ns 115616.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3433218 ns 3435294.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 231793 ns 238972 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22667 ns 22250 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22104.5 ns 22583.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23625 ns 22875 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 26729 ns 23104.5 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 555861 ns 561561 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20482208 ns 19972002 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 545740 ns 652206 ns 0.84
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4334 ns 4167 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4333 ns 4375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4458 ns 0.94
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4375 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 23923 ns 24400 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2205811 ns 2160362 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 44864 ns 49090 ns 0.91
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16500 ns 16167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16333 ns 16625 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16166 ns 16291 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16292 ns 16584 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 319806 ns 320232 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 10190777 ns 12103289.5 ns 0.84
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 186077 ns 205902 ns 0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2125 ns 2084 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2084 ns 2209 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2209 ns 2167 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2125 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35327 ns 35395 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1213779 ns 1121264.5 ns 1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 199242 ns 218222 ns 0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17104 ns 17896 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20167 ns 17916 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 19000 ns 19125 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23083.5 ns 18146 ns 1.27
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 284984 ns 286121 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 18211018 ns 20551833.5 ns 0.89
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 583431 ns 685457 ns 0.85
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59458 ns 60208.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65666 ns 65458 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66125 ns 60938 ns 1.09
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 52833 ns 53875 ns 0.98
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66304 ns 66633 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI 87707222.5 ns 86298273 ns 1.02
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 110241 ns 102431 ns 1.08
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 153041 ns 197791.5 ns 0.77
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 155229 ns 162042 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 130209 ns 137250 ns 0.95
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 286334 ns 295208 ns 0.97
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 210129.5 ns 211289 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI 149924497 ns 152039178 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 511145 ns 510905 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 106521 ns 123834 ns 0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 78958 ns 123125 ns 0.64
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84042 ns 84312.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 115521 ns 90875 ns 1.27
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191513.5 ns 193182.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5334020 ns 5322780 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 267630 ns 192502 ns 1.39
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1894896 ns 1921875 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1902375 ns 1909416 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1878334 ns 1888250 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895250 ns 1881750 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 507442 ns 510619 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28152566.5 ns 26283882 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 825763 ns 911709 ns 0.91
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21516 ns 21603 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2100524 ns 2089663 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 35507 ns 42021 ns 0.84
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 245735 ns 246530.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9780504 ns 9718939 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 164548 ns 183711 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10916 ns 8083 ns 1.35
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8291 ns 9791 ns 0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11146 ns 12125 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9500 ns 8458 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 114788 ns 115667.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3351587 ns 3479265.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 232004 ns 238712 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8916 ns 10334 ns 0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8854.5 ns 10375 ns 0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10917 ns 10709 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9583 ns 10750 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 491693 ns 493762.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19969043 ns 19419012 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 536332 ns 631376 ns 0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57958 ns 58459 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46625 ns 46541 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46750 ns 39791 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83166 ns 82958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38476.5 ns 39195 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1460287 ns 1326636 ns 1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 71814 ns 77861 ns 0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1905145.5 ns 1927333.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1949542 ns 1977312 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1958500 ns 1955167 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1874958 ns 1892417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 212675 ns 217765.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33332615 ns 33483865 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 968925.5 ns 1004015.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267500 ns 267875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 271479.5 ns 277417 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 271209 ns 270958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 268209 ns 278250 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 194219.5 ns 198525 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7638787 ns 7684906 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 271267 ns 283563 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585333.5 ns 614937.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 600292 ns 658104 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 671042 ns 590146 ns 1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 845604.5 ns 646750.5 ns 1.31
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 991966 ns 1004951 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42952243 ns 44721716 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 831153 ns 899859 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2211666 ns 2206250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2203958 ns 2176625 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2229083 ns 2107416 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2173792 ns 2210708 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 161646 ns 158799 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8668502.5 ns 8305150 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 470965 ns 412934 ns 1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5493104.5 ns 5495166.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5515875 ns 5498084 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5526542 ns 5497292 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6852458 ns 5479145.5 ns 1.25
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 959137 ns 942447 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49532486 ns 52379643 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1437405 ns 1717957 ns 0.84
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 478292 ns 476375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 345625 ns 344833 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 346750 ns 255667 ns 1.36
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 908542 ns 909083 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46909 ns 46257.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 871386 ns 876632 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 393175 ns 245143 ns 1.60
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2137500 ns 2148125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1869334 ns 1855417 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1859271 ns 1588042 ns 1.17
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3380209 ns 3122292 ns 1.08
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 264095.5 ns 253305 ns 1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 13390420 ns 13286897 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 632907.5 ns 772413 ns 0.82
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57458 ns 57958.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46166 ns 45791.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46250 ns 39417 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78667 ns 82625 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28560 ns 28551 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1394875.5 ns 1363872 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73147 ns 74231 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029292 ns 2040292 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2078187.5 ns 2064375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2063250 ns 2084167 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1963958 ns 1983271 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230846.5 ns 225739 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36347331 ns 35716396.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 980522 ns 1031871 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58083.5 ns 58333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46584 ns 46834 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46917 ns 39667 ns 1.18
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79958 ns 83000 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48944 ns 48471 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 829446 ns 789293.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 71428.5 ns 71026 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1871729 ns 1926917 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973604 ns 1963709 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1944167 ns 1974354 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1876792 ns 1891625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 238010 ns 232200 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 18705710.5 ns 17717639 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 881607.5 ns 916564 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34878 ns 33909 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1190778.5 ns 1226571 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47028 ns 45910 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6270.5 ns 5916 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6187.5 ns 7187.5 ns 0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7459 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6333 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 211705.5 ns 201066 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20119098 ns 20694042 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 332741 ns 365424 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32902 ns 32008 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1224139 ns 1150720 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 36327 ns 37940 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2709 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2667 ns 3041 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 4292 ns 3708 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3167 ns 3500 ns 0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 187662.5 ns 181870 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 5673429 ns 7654347.5 ns 0.74
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 136635 ns 149631 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 467208 ns 491875 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 469417 ns 465938 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 466875 ns 469979 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 464979.5 ns 495375 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 137312 ns 134587.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5812904.5 ns 6261994 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 361475 ns 348083 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4027749.5 ns 4056250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4071500 ns 4071312.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4067417 ns 4083458.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5516750 ns 4067500 ns 1.36
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 690445 ns 675142 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32063716 ns 34719295 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1091915 ns 1296728 ns 0.84
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49879250 ns 49815354 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35487583 ns 35531875 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35512833.5 ns 25976083 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96974083 ns 96976979 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1622377 ns 1620332 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI 55868634.5 ns 55439103 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1579230 ns 1059456 ns 1.49
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154423062.5 ns 154432166.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112364750 ns 112364500.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112377416 ns 88728958 ns 1.27
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 299989812 ns 298587354.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6468945 ns 6497993.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI 126761495 ns 126106582 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 7230228 ns 5589506 ns 1.29
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 19104.5 ns 18292 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 18375 ns 17542 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17375.5 ns 13625 ns 1.28
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15083 ns 16583.5 ns 0.91
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19621 ns 19675 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1223248 ns 1142269.5 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 28854 ns 27480 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11062.5 ns 11000 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 8833 ns 9020.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9291 ns 7792 ns 1.19
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17667 ns 17375 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 252067.5 ns 242665 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9844493 ns 10148653 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 138484 ns 144671.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7937.5 ns 7958.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8125 ns 9125 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10375 ns 10375 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8708 ns 7833.5 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 120230.5 ns 117743.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3557828.5 ns 3571636.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 235119 ns 238312 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9708 ns 9083 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9084 ns 10188 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9792 ns 11500 ns 0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10667 ns 9500 ns 1.12
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 599437 ns 580494.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22720103 ns 24076504 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 557070 ns 649931.5 ns 0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9291.5 ns 9416 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8812.5 ns 9709 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9917 ns 10458 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8958.5 ns 9396 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 118821 ns 114984 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3465548.5 ns 3341616 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 71593 ns 71321 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13687.5 ns 13916.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13604.5 ns 13541.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14395.5 ns 17208.5 ns 0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14750 ns 13187.5 ns 1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 570663 ns 552056 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20121784.5 ns 20781499.5 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 323504 ns 344233 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35088 ns 33628 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1218149.5 ns 1186325 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 203871 ns 207932 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7562.5 ns 7437 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 8584 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 9666 ns 0.81
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8520.5 ns 7354.5 ns 1.16
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 227876 ns 221757 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22566032 ns 22841477 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 569945 ns 657467 ns 0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16458 ns 16583 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 17041 ns 16958 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 16209 ns 12354 ns 1.31
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10979 ns 11625 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 20941 ns 19779 ns 1.06
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1150830 ns 1178666.5 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 182992 ns 191642 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 35666 ns 35375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 35167 ns 35479 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 36000 ns 35479.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 57833 ns 35584 ns 1.63
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 265749 ns 258411 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12188303 ns 11074698.5 ns 1.10
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 534293 ns 591756 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 447500 ns 449333 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 488042 ns 450125 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 455709 ns 463875 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 496916 ns 486917 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195513 ns 194667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5997948.5 ns 5885088 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 328714 ns 347133 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4024209 ns 4054500 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4055021 ns 4060604.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4053917 ns 4063834 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5501562.5 ns 4052291.5 ns 1.36
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 521631.5 ns 510233 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27256015 ns 28172431.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1059038 ns 1353408.5 ns 0.78
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 836727208 ns 780318375 ns 1.07
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 553913292 ns 543371375 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 540736625 ns 415007687 ns 1.30
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1517196875 ns 1572225062.5 ns 0.96
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22767789 ns 22558969 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/oneAPI 174930068 ns 174041531 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 10331681 ns 14555295 ns 0.71
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3773348667 ns 2500858833 ns 1.51
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1782084291 ns 1786181583 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1780399750 ns 1510021583 ns 1.18
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4786718666 ns 6317458166 ns 0.76
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118657187 ns 119503116 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI 1332561794 ns 931368955.5 ns 1.43
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 67063298 ns 87832876 ns 0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76542 ns 76375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76584 ns 77083 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79583 ns 83334 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76708.5 ns 75354 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 195943.5 ns 194473.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 5455658.5 ns 8155928 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 123300.5 ns 106291 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 191292 ns 277375 ns 0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 252042 ns 193666.5 ns 1.30
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 199562.5 ns 291542 ns 0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225542 ns 203875 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1004442 ns 999103 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43458500 ns 42482446 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 590764 ns 628231.5 ns 0.94
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199694520.5 ns 199366166.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138856500 ns 139444084 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139241166 ns 103950000 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 393790959 ns 388306958 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5842492 ns 5837076.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI 78913006.5 ns 78178829 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 4746717.5 ns 3620336 ns 1.31
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 617676375.5 ns 617703104.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 439446917 ns 438890042 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439765166.5 ns 352507250 ns 1.25
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1174222000 ns 1183186458 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26723523 ns 26786910.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI 276392509 ns 274964991 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 15854720 ns 21952578.5 ns 0.72
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 6125 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 5417 ns 1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9834 ns 9917 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26896.5 ns 26517 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1173091 ns 1160586 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 55173 ns 46431 ns 1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213041.5 ns 224854 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227729 ns 230541 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220416.5 ns 229812.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206125 ns 207958 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 219868 ns 215879.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20153337 ns 20490896 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 541982 ns 528825 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8521 ns 6458 ns 1.32
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7458 ns 9000 ns 0.83
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11167 ns 9750 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9250 ns 8770.5 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 115361 ns 109989.5 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3392154.5 ns 3318372 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 74069 ns 72691 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7562.5 ns 7666.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7958 ns 8417 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8167 ns 11750 ns 0.70
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7395.5 ns 7562.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 495697 ns 485874.5 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20965461 ns 19877956 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 309298 ns 315043 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 417 ns 458 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns 750 ns 0.61
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 750 ns 0.67
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 375 ns 459 ns 0.82
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26124 ns 25151 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1243719 ns 1214235 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 45334 ns 48561 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9584 ns 8833 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9062.5 ns 9542 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9792 ns 11834 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9542 ns 8250 ns 1.16
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 247606 ns 245667 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 24899790.5 ns 23350383.5 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 382304 ns 388103 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 112312.5 ns 111708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 103229 ns 101708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 104104.5 ns 87542 ns 1.19
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 155083 ns 154542 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 23501 ns 22556 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 811475 ns 822944.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 192539 ns 200302 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 536562 ns 576604.5 ns 0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 554250 ns 577208 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 535291.5 ns 579583 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 910854 ns 535334 ns 1.70
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 221242 ns 215893 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11751092 ns 11598893 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 560216.5 ns 606916 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5416.5 ns 5500 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6208.5 ns 6187.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 6021 ns 7583 ns 0.79
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4000 ns 5646 ns 0.71
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17520 ns 16999 ns 1.03
batchedmm(16, Bsize=32)/forward/GPU/oneAPI 72849606 ns 71875004 ns 1.01
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 73648 ns 71250 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11562.5 ns 12166.5 ns 0.95
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11062 ns 10833.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11000 ns 11104 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16666 ns 16667 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 207455.5 ns 203355.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI 97442684 ns 97881235 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 330387 ns 362713 ns 0.91
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39667 ns 40375 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51291 ns 51334 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52958.5 ns 51083 ns 1.04
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13625 ns 13625 ns 1
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20356 ns 21217 ns 0.96
batchedmm(16, Bsize=128)/forward/GPU/oneAPI 76663129 ns 78292175 ns 0.98
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 98364 ns 81245.5 ns 1.21
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36375.5 ns 37437.5 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 31417 ns 31833.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31229.5 ns 30145.5 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57000 ns 57333 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 184178 ns 180954 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI 111708023 ns 111821475 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 355254 ns 393694 ns 0.90
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1750 ns 1667 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 2042 ns 1834 ns 1.11
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2583 ns 0.85
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1583 ns 1.18
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19575 ns 19103 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1219758.5 ns 1181507 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 29099.5 ns 29580 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2208 ns 2291 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2167 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 2541 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2208 ns 2125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 198996.5 ns 192587.5 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8766738.5 ns 9137253 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 128571 ns 137661 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4583 ns 5041 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4417 ns 4792 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6729 ns 6708 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3958 ns 5292 ns 0.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 143699.5 ns 139532 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5704411.5 ns 5873388 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 61955.5 ns 61421 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8250 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8083.5 ns 8583 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8709 ns 9333 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8583 ns 8062.5 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 836045.5 ns 812160 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39725172 ns 39105619 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 364891 ns 390114 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 54833 ns 55000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 55833 ns 55875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 55583 ns 54333 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 56000 ns 56208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36570 ns 36258 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1345223 ns 1233762 ns 1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202568 ns 217247.5 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 476729 ns 523187.5 ns 0.91
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 494500 ns 495646 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 494208 ns 509125 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 641625 ns 508354 ns 1.26
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 259886 ns 258312 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 28017517.5 ns 27334844 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 705894 ns 802628 ns 0.88
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3310333 ns 3307500 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2334062.5 ns 2332208.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2333375 ns 1767750 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6300479 ns 6289687.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204581.5 ns 205336 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/oneAPI 77398976 ns 78138642 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 373097 ns 213372 ns 1.75
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11459729 ns 11443687 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8305729.5 ns 8355854.5 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8342854 ns 6598583.5 ns 1.26
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21088292 ns 21066479 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 744676 ns 735491 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI 121497637 ns 121355919 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1994797.5 ns 1063901 ns 1.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4833 ns 7208 ns 0.67
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4646 ns 6604 ns 0.70
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7520.5 ns 7708 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4917 ns 4708 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 133339 ns 130238.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5450569.5 ns 5600093 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 61520 ns 55701 ns 1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7083 ns 7604 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7291.5 ns 7562.5 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 8083 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7416.5 ns 7292 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 725863 ns 714522 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 33872141 ns 35658157 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 353680 ns 368784 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 100459 ns 98292 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 123042 ns 103667 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102417 ns 127291 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 121458.5 ns 122417 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 151940.5 ns 149309 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5695179 ns 5831672 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 233346 ns 183632 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033271 ns 2028041 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026417 ns 2022292 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1997458.5 ns 2031625 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2041833 ns 2019021 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 678763 ns 669751 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31810809 ns 34116389.5 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 931831 ns 1113696 ns 0.84
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32666 ns 32999.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36562.5 ns 36208 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 36167 ns 33125 ns 1.09
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 667 ns 542 ns 1.23
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15627 ns 15437 ns 1.01
batchedmm(2, Bsize=4)/forward/GPU/oneAPI 72187220 ns 72358742 ns 1.00
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 70121 ns 84900 ns 0.83
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2604.5 ns 2667 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2958 ns 3000 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2937.5 ns 3208 ns 0.92
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2167 ns 2250 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 139744 ns 136315 ns 1.03
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI 92749943 ns 92893398 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 289641 ns 350423 ns 0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7208 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6083 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5916 ns 5416 ns 1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35855 ns 35436 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1252207 ns 1228537 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 53911 ns 49691 ns 1.08
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212958.5 ns 232749.5 ns 0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222708 ns 221125 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219917 ns 227541.5 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206209 ns 205750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243430 ns 240533 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27468024.5 ns 26122810 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 513269 ns 509435 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3917 ns 0.96
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3958 ns 0.95
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3791 ns 3917 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21959 ns 21412 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2194149 ns 2114597 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 35557 ns 42980 ns 0.83
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14500 ns 14542 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14500 ns 14917 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14500 ns 14792 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14459 ns 14917 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 302419 ns 297410.5 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11036089 ns 10838818 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 179841 ns 196172 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 128041 ns 97937 ns 1.31
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144417 ns 102750 ns 1.41
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 106917 ns 130333 ns 0.82
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 151959 ns 127709 ns 1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 140874 ns 132466 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5963081 ns 5909094 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236762 ns 182122 ns 1.30
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924583 ns 1924333 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920500 ns 1920667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1914229.5 ns 1921792 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1928875 ns 1912771 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 673452 ns 659652 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29935915 ns 31062786 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 899671 ns 1217372 ns 0.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17333 ns 17625 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17354.5 ns 18666.5 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21208 ns 21834 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17375 ns 17125 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 108833.5 ns 103789.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3415955 ns 3441121 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 91100 ns 75841 ns 1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216917 ns 229375 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 252646 ns 217917 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222166 ns 226458.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 229125 ns 215521 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 508535.5 ns 496186 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19323488.5 ns 18765642 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 419764 ns 473665 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24271 ns 24313 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 30791.5 ns 29875 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 29437.5 ns 27375 ns 1.08
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1584 ns 1250 ns 1.27
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16398 ns 15897 ns 1.03
batchedmm(16, Bsize=4)/forward/GPU/oneAPI 72518390 ns 71655631.5 ns 1.01
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 76093 ns 87071 ns 0.87
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4500 ns 5375.5 ns 0.84
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4916 ns 5083.5 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5125 ns 5459 ns 0.94
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4625 ns 4834 ns 0.96
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 204364 ns 200684.5 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI 94073985 ns 92849344 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 331675 ns 389014 ns 0.85
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222666 ns 222083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220666.5 ns 223166 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 225667 ns 224916.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220583 ns 227000 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 222506.5 ns 219523 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7881934.5 ns 7712821.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 267871 ns 274002.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 495084 ns 495292 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 511812.5 ns 549771 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 500854 ns 507520.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 675750 ns 497583 ns 1.36
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1053634 ns 1034369 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42862742 ns 42519004 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 780999 ns 850318.5 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20375 ns 19708 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20000 ns 21375 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23875 ns 22292 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18792 ns 24792 ns 0.76
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 114286 ns 111603.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3510843 ns 3581394.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 89858 ns 77006 ns 1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212375 ns 218812 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213041 ns 213041.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214458 ns 221958.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212541 ns 250667 ns 0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 727333.5 ns 710892 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24570511 ns 24867084.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 469036 ns 532655 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6666 ns 5959 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6604.5 ns 6917 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8750.5 ns 8708 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6208 ns 5917 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 137142 ns 131648 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5605207 ns 5786966.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 60974 ns 65661 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9791 ns 10584 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10084 ns 10729.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10750 ns 11541 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10750 ns 10541 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 794651.5 ns 772200 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37034174 ns 37330612 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 370101.5 ns 385494 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4666 ns 4833 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 6354.5 ns 0.74
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7437.5 ns 6604.5 ns 1.13
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4917 ns 5041 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 138544.5 ns 133064 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5520602 ns 5822443 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 59692 ns 57140 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7209 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7166 ns 7666 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7791 ns 8042 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 7500 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 755761 ns 738153 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 37179182 ns 40138762.5 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 376523 ns 395034 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14498417 ns 14423167 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10124125 ns 10121834 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10094833 ns 7695041.5 ns 1.31
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27748583.5 ns 27731208 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532665 ns 530060 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI 94795139 ns 94502665 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 866850 ns 400144 ns 2.17
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46333437 ns 46295271.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33447541.5 ns 33585729.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33510458 ns 26523271 ns 1.26
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85445667 ns 85105834 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2636151 ns 2636621 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI 192783631 ns 190779173 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 5189385.5 ns 3293333 ns 1.58
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66458 ns 67125 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 65687.5 ns 68791 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 70500 ns 69875 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66500 ns 67541 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 118172.5 ns 116341 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3662360 ns 3481863 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 237313 ns 238303 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 467958 ns 467979.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 480333.5 ns 468833 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 474916.5 ns 479729 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 686583.5 ns 467333.5 ns 1.47
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 715446 ns 704065 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26609747 ns 26310960 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 655875 ns 795648 ns 0.82
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32877 ns 32111 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1227269 ns 1221683 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47579 ns 47180 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8750 ns 8375 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9208 ns 9417 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9104.5 ns 9584 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9750 ns 8416 ns 1.16
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 280778.5 ns 277435.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21881943 ns 20099617 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 355484 ns 375813.5 ns 0.95
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9500 ns 9459 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9500 ns 9625 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9500 ns 9708 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9500 ns 9625 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23273 ns 22950 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 1862112.5 ns 2089156.5 ns 0.89
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 200655 ns 212492 ns 0.94
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 50209 ns 50167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 50250 ns 50292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 50500 ns 50541 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 72375 ns 50375 ns 1.44
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 278469.5 ns 272026 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13204061 ns 11125411 ns 1.19
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 491037 ns 611216 ns 0.80
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 54917 ns 55250 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 55667 ns 55917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 55584 ns 54375 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 56000 ns 56041 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28169 ns 27749 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1174691 ns 1229944.5 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 203240 ns 214587 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 518854 ns 485479 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 500625 ns 496084 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497750 ns 537000.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 643417 ns 461291.5 ns 1.39
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238777 ns 237315 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31628121.5 ns 32908722.5 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 758938 ns 839118 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 655042 ns 651166.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 613083 ns 645917 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 652541 ns 662000 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 678416.5 ns 641417 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192069 ns 190601 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8140636 ns 8668801 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 269704 ns 229822 ns 1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2167104.5 ns 2241917 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2233125 ns 2232875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2241292 ns 2250458.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2230208.5 ns 2234417 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 929752.5 ns 914905 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 55073105 ns 49141404 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1217770.5 ns 1359913 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19500 ns 21375 ns 0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19208.5 ns 20938 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23542 ns 22583 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20000 ns 19167 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111306 ns 109650 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3589059.5 ns 3622083 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 91551 ns 75660 ns 1.21
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220459 ns 218833.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226458 ns 221084 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223104.5 ns 235688 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219708 ns 221125.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 714110 ns 709252 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26626181 ns 25088612.5 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 487481 ns 553695 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23491 ns 23372.5 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1232519 ns 1180770.5 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 43771 ns 49900 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9874.5 ns 0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9291.5 ns 9708 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9708 ns 10229.5 ns 0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9646 ns 9334 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 261581 ns 259739 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 23734390 ns 25804898 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 381618 ns 401304 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8917 ns 9541 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7583 ns 9187.5 ns 0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11854.5 ns 10833 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 8875 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 115935.5 ns 113457.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3441325 ns 3378008 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 70456.5 ns 69850 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 7625 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7937.5 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 8375 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 7541.5 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 484010 ns 474853 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17813154.5 ns 17576598 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 302215 ns 322123 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1417 ns 1500 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1667 ns 1666.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1959 ns 2187.5 ns 0.90
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1500 ns 1542 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20030 ns 19317 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1146657 ns 1172938.5 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 184144 ns 192092 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3708 ns 3542 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3625 ns 3625 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3833 ns 3833 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 3500 ns 1.40
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 213101.5 ns 209093.5 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10511562.5 ns 10006581.5 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 524324.5 ns 581056 ns 0.90
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148729 ns 148416 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 128917 ns 127541.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129917 ns 107500 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 235541 ns 225042 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 22778 ns 22459 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1179919.5 ns 1201113 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 46868 ns 37415.5 ns 1.25
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143645.5 ns 143666.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 130875 ns 110916 ns 1.18
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 138417 ns 100875 ns 1.37
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 290021 ns 250834 ns 1.16
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 211960 ns 206476 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10741797 ns 10778609 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 223578 ns 220822 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7334 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958.5 ns 5375 ns 1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10041 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33236 ns 33038 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1203805 ns 1161067.5 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 57207 ns 48271 ns 1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221249.5 ns 220021 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 238542 ns 227708 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 264500 ns 243333 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213250 ns 212750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 259447 ns 256906 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27707385 ns 27263274.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 530542 ns 522055 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13209 ns 12333 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12166 ns 13020.5 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13584 ns 14333.5 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12667 ns 12917 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 135078 ns 131126.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5685986 ns 5521631 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 227730.5 ns 235402 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23917 ns 24520.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24083.5 ns 24187 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24750 ns 25354.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 30146 ns 23625 ns 1.28
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 833527 ns 816371.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39963084.5 ns 39369345 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 615374.5 ns 684572 ns 0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9271 ns 9208 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9541 ns 10042 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10375 ns 11167 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9250 ns 9625 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 119628 ns 116949.5 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3356719.5 ns 3478536 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 74940 ns 70201 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14041 ns 14250 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13958 ns 13771 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14750 ns 15416 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13459 ns 13958 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 638262 ns 627909.5 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22466836 ns 21438120 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 344824 ns 377354 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9666.5 ns 8958 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9208 ns 10437.5 ns 0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10959 ns 11750 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9083.5 ns 9166 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 118521 ns 115964 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3571671.5 ns 3401614.5 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 79399 ns 72371 ns 1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13416 ns 13208 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12416 ns 12854 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13479.5 ns 13958 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12708 ns 12416 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 530027 ns 516349.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19360325 ns 19477250 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 317163 ns 339683.5 ns 0.93
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 30896 ns 30291.5 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33813 ns 34041.5 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 32249.5 ns 30042 ns 1.07
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1875 ns 2083 ns 0.90
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16425 ns 16187 ns 1.01
batchedmm(2, Bsize=128)/forward/GPU/oneAPI 76985679 ns 75928615 ns 1.01
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 76663 ns 78561 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5417 ns 5291.5 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5000 ns 5499.5 ns 0.91
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5479.5 ns 5375 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6270.5 ns 6375 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138278 ns 135964 ns 1.02
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI 109824422.5 ns 110752109 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 340566 ns 382864 ns 0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 291 ns 1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25574 ns 24855 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1142450 ns 1239551 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 45666 ns 48910 ns 0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6459 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6604 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6791.5 ns 7208.5 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6458.5 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 185923.5 ns 180794 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22900684.5 ns 24106911.5 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 365402.5 ns 390139 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2084 ns 2000 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2084 ns 2125 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2125 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2042 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26453 ns 25818 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1207656 ns 1193002 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 203645.5 ns 219547 ns 0.93
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18041 ns 17500.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17166.5 ns 17833.5 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17750 ns 18437.5 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 23458.5 ns 17500 ns 1.34
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 268326 ns 264425 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24994377.5 ns 24505308 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 600702.5 ns 705652 ns 0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147875 ns 178208 ns 0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 155437.5 ns 165145.5 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 155125 ns 179042 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 151708 ns 151292 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190890.5 ns 187400 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7974634 ns 7801096 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 271146.5 ns 191502 ns 1.42
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1321937.5 ns 1317104 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1330625 ns 1320125 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1308375 ns 1331937 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1285166 ns 1318125.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 867140 ns 859849 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45331705.5 ns 43918638 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1006962 ns 1005140 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25500 ns 24084 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23542 ns 24708 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28708.5 ns 28063 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24416.5 ns 26291.5 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 226899 ns 226248 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7680667 ns 8086333 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 128029 ns 115141 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 125062.5 ns 160416.5 ns 0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 165729.5 ns 132958 ns 1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 125854.5 ns 127937.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 180062 ns 124437.5 ns 1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 998018.5 ns 978646 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44411227 ns 45755327 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 568743 ns 587856 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23453 ns 22971 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1190116 ns 1181802 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 44533 ns 48630 ns 0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6895.5 ns 6333 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6729.5 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6958 ns 7291 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6520.5 ns 6541.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 201834 ns 197400 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23542895 ns 24703832 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 372536 ns 392804 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5645.5 ns 5584 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5375 ns 6958 ns 0.77
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7979 ns 8021 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5166 ns 5875 ns 0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 139838.5 ns 135487.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5619575.5 ns 5687030 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 229750 ns 235072 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9958 ns 10083.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10042 ns 10458.5 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10500 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10854.5 ns 9833.5 ns 1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 866511 ns 841087.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 43130156 ns 41023608 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 603858 ns 675251.5 ns 0.89
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 708 ns 708 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 667 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 750 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 708 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22827 ns 22206 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2079377 ns 2616381.5 ns 0.79
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 202368 ns 209832 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4875 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4833 ns 4917 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5125 ns 5208 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6291 ns 4875 ns 1.29
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 222098 ns 215367.5 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9952955 ns 11863776 ns 0.84
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 471721 ns 591926 ns 0.80
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8750 ns 7729.5 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7834 ns 7958 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9375 ns 9833 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7646 ns 7750.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 117939.5 ns 115622 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3568146 ns 3536818 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 74409 ns 71851 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8792 ns 8542 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8687.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8875 ns 9520.5 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8083 ns 8520.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 568724.5 ns 552863 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 20842961 ns 20606879 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 335106 ns 343673.5 ns 0.98
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126042 ns 127854 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129208 ns 128834 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129542 ns 96354 ns 1.34
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180792 ns 183167 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46423 ns 45982 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/oneAPI 72616088 ns 72286847 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 101850 ns 95811 ns 1.06
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 315875 ns 330459 ns 0.96
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 334166.5 ns 332334 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 323291.5 ns 197417 ns 1.64
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 609395.5 ns 571042 ns 1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 187684 ns 183822.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI 93899553 ns 93731117 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 405833.5 ns 473290 ns 0.86
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397500 ns 397125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287979.5 ns 288229 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288375 ns 215375 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756000 ns 756375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43964 ns 43348 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1424885 ns 1384285 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 79439 ns 79971 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1461000 ns 1459375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133834 ns 1132396 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1129645.5 ns 862770.5 ns 1.31
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2449292 ns 2442500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 254140 ns 239777 ns 1.06
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11042616 ns 13231788 ns 0.83
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 254646 ns 351138.5 ns 0.73
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 626500 ns 647458 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 657208.5 ns 649666 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 649750.5 ns 655021 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 642417 ns 641583.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 185720.5 ns 178508 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8332264.5 ns 8381344 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 264649 ns 240322 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2452625 ns 2454875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2465208.5 ns 2450333 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2459375 ns 2461646 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2376375 ns 2458334 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 949649 ns 938639.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 53455476.5 ns 52014786 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1323598 ns 1448719 ns 0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32458 ns 33146 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 36521 ns 35708 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34833 ns 32000 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 959 ns 875 ns 1.10
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15902 ns 15683 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/oneAPI 73782106 ns 73122838 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 74499.5 ns 71645.5 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3125 ns 3187.5 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3250 ns 3458 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3375 ns 3541 ns 0.95
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3062.5 ns 3083 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 137187.5 ns 134592 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI 98822060.5 ns 97284653 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 314258 ns 337323.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 436500 ns 439375 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 438625 ns 440583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 438791 ns 431375 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 445917 ns 450375 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42826 ns 42224 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1503651 ns 1392161 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 374379.5 ns 237893 ns 1.57
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4140000 ns 4138958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4271375 ns 4247291.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4270687.5 ns 4262792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5468750 ns 4028416.5 ns 1.36
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 236201.5 ns 233746 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36248116 ns 36534446 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1135862 ns 1234322 ns 0.92
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3791 ns 3917 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3958 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3916 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34158 ns 34090 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1274307 ns 1239089 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 41117 ns 40520 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15375 ns 15291 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15334 ns 15958 ns 0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15500 ns 15750 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15250 ns 15667 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 255579 ns 251120.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8309435 ns 8891050 ns 0.93
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 158606 ns 171192 ns 0.93
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404792 ns 404125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295917 ns 295250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295958 ns 220625 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 759750 ns 760666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113245 ns 113428 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1043498 ns 1051037 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 91962 ns 89110.5 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1482854 ns 1479125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1158625 ns 1156270.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1150334 ns 886792 ns 1.30
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2466708 ns 2464333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 236768.5 ns 227639.5 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 9725420.5 ns 12228324 ns 0.80
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 298578 ns 352474 ns 0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25569 ns 24868 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1198679 ns 1263047 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 202679 ns 214292 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8083 ns 7541 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7792 ns 7917 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8250 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8437.5 ns 7667 ns 1.10
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 207068.5 ns 202491 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25228707 ns 25565257 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 593474 ns 687187 ns 0.86
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 829375 ns 830417 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 617667 ns 617334 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 618667 ns 467125 ns 1.32
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1544417 ns 1539875 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130866 ns 130469 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI 74874331.5 ns 74138060 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 211214 ns 167662 ns 1.26
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2686104.5 ns 2680895.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1994542 ns 1979750 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1998375 ns 1532167 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4960479 ns 4935708 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 234509 ns 233179 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI 102181218 ns 101283369 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 831293.5 ns 855698 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32562 ns 31956 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1276503 ns 1162026.5 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48691 ns 49090 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6333 ns 6187 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6770.5 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 7042 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6104.5 ns 6375 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 227701 ns 217529.5 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21756022 ns 22613407 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 346728 ns 355723.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1760625 ns 1750042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1749875 ns 1774250 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1744292 ns 1759417 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1755166 ns 1775625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189332 ns 177451 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7765672 ns 8059544 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 413433 ns 355403 ns 1.16
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4360416 ns 4352125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4366917 ns 4360770.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4349104 ns 4377083.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5705104 ns 4357583 ns 1.31
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 849205 ns 843625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48802559 ns 47645217 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1205562.5 ns 1390698 ns 0.87
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 9604 ns 14562.5 ns 0.66
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6916 ns 9667 ns 0.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 8208 ns 8292 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6854 ns 6666.5 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22924.5 ns 22207 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1184238.5 ns 1231018 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 46437 ns 37720 ns 1.23
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 50604.5 ns 64458.5 ns 0.79
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 52166 ns 70792 ns 0.74
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 45458.5 ns 45708 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 33312.5 ns 49521 ns 0.67
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 211538 ns 204835 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10576796.5 ns 10627124.5 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 226508 ns 233202 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21646 ns 21292 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26083.5 ns 24770.5 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24958.5 ns 22334 ns 1.12
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5291.5 ns 7416 ns 0.71
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18121 ns 17630 ns 1.03
batchedmm(2, Bsize=512)/forward/GPU/oneAPI 88732630 ns 87889435 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 73668 ns 90301 ns 0.82
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12125 ns 12187 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10667 ns 10625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10833 ns 9750 ns 1.11
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18042 ns 18041.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 221707 ns 216733.5 ns 1.02
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI 148404121 ns 151365483 ns 0.98
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 322703 ns 384574 ns 0.84
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405917 ns 405417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 296791.5 ns 297333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 297167 ns 223417 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 756709 ns 762625 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46696 ns 46368 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1393570.5 ns 1390104 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 90770 ns 90091 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1487375 ns 1487792 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1163500 ns 1159187.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1157209 ns 892375 ns 1.30
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2472417 ns 2470895.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 283340.5 ns 267416 ns 1.06
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11947586 ns 13880635 ns 0.86
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 269032 ns 378473.5 ns 0.71
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 436458 ns 436458 ns 1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 443270.5 ns 438916 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 440750 ns 431708 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 449000 ns 450167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53940 ns 53539 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1027722 ns 1016245 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 323133 ns 235682 ns 1.37
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4138541 ns 4143041 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4268354.5 ns 4257999.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4258750 ns 4266292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5475229.5 ns 4032437.5 ns 1.36
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 255597 ns 253837 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31502698.5 ns 31122046.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1132896.5 ns 1206682 ns 0.94
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9333 ns 9208 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 8000 ns 8167 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 8000 ns 7208 ns 1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 13250 ns 13416 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23885 ns 23370 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 1973050 ns 2190811 ns 0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 202528 ns 212852 ns 0.95
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 49625 ns 49416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 49667 ns 50083 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 49583 ns 49541 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 71667 ns 49667 ns 1.44
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 336641 ns 331181 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13058534 ns 12227793 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 508895.5 ns 657676 ns 0.77
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 108270.5 ns 123458 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86167 ns 85271 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86500 ns 127292 ns 0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146083 ns 108541.5 ns 1.35
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192063 ns 191180.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5750624 ns 6110005 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 267851 ns 200667 ns 1.33
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2018917 ns 2014999.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2016937.5 ns 1877583 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2011375 ns 2016083 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024000.5 ns 2015916 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 511598 ns 510301 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30563079 ns 27606531 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 860237 ns 943229 ns 0.91

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.