Skip to content

Commit

Permalink
docs: fix links to CI
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Nov 3, 2024
1 parent 007b559 commit 409eda2
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,14 +88,14 @@ Pkg.add("Lux")

<!-- CI -->

[gh-actions-lux]: https://github.com/LuxDL/Lux.jl/workflows/CI/badge.svg
[gh-actions-lux-prerelease]: https://github.com/LuxDL/Lux.jl/workflows/CIPreRelease/badge.svg
[gh-actions-luxlib]: https://github.com/LuxDL/Lux.jl/workflows/CI_LuxLib/badge.svg
[gh-actions-luxcore]: https://github.com/LuxDL/Lux.jl/workflows/CI_LuxCore/badge.svg
[gh-actions-mldatadevices]: https://github.com/LuxDL/Lux.jl/workflows/CI_MLDataDevices/badge.svg
[gh-actions-weightinitializers]: https://github.com/LuxDL/Lux.jl/workflows/CI_WeightInitializers/badge.svg
[gh-actions-luxtestutils]: https://github.com/LuxDL/Lux.jl/workflows/CI_LuxTestUtils/badge.svg
[gh-actions-luxcuda]: https://github.com/LuxDL/Lux.jl/workflows/CI_LuxCUDA/badge.svg
[gh-actions-lux]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(Lux)/badge.svg
[gh-actions-lux-prerelease]: https://github.com/LuxDL/Lux.jl/workflows/CIPreRelease%20(Lux)/badge.svg
[gh-actions-luxlib]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(LuxLib)/badge.svg
[gh-actions-luxcore]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(LuxCore)/badge.svg
[gh-actions-mldatadevices]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(MLDataDevices)/badge.svg
[gh-actions-weightinitializers]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(WeightInitializers)/badge.svg
[gh-actions-luxtestutils]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(LuxTestUtils)/badge.svg
[gh-actions-luxcuda]: https://github.com/LuxDL/Lux.jl/workflows/CI%20(LuxCUDA)/badge.svg
[gh-actions-lux-url]: https://github.com/LuxDL/Lux.jl/actions/workflows/CI.yml
[gh-actions-lux-prerelease-url]: https://github.com/LuxDL/Lux.jl/actions/workflows/CIPreRelease.yml
[gh-actions-luxlib-url]: https://github.com/LuxDL/Lux.jl/actions/workflows/CI_LuxLib.yml
Expand Down

1 comment on commit 409eda2

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 409eda2 Previous: 699c8d8 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4334 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4125 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5417 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4167 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 59978 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10333 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10167 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10167 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 416390 ns
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1166.5 ns
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3042 ns
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1208 ns
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1000 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18063 ns
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4084 ns
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3958 ns
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4125 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109325.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56041 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46084 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46375 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81834 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36229 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2056625 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082416.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2056666.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1995458 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192802 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 172458 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144854.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148125 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146125 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166789 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1157666 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1110395.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1128416.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1120208 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 516061 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3583 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3583.5 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4229.5 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3292 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 69748 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8792 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9125 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9209 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 470533 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15083 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14875 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16583 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14917 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 53475 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222375 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213084 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213250 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213520.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 267675 ns
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 584 ns
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17384 ns
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1500 ns
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1500 ns
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1750 ns
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1583 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103376 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7041 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5625 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5709 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9916 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23093 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227583.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230417 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228000 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215542 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 166208.5 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3916 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3834 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3834 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23533 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16708 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16750 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16791 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16625 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 160718 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 577333 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 573417 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 579000 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574042 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113474 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1432312.5 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1426250 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1425917 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1418000 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 211622 ns
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1046541 ns 1068292 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 965500 ns 983291 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1347458 ns 1327542 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1290542 ns 1373792 ns 0.94
lenet(28, 28, 1, 64)/forward/GPU/CUDA 267857 ns 281111 ns 0.95
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5895833.5 ns 6002271 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4588042 ns 4660958.5 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4928187 ns 5006354 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5737167 ns 5624708 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1066176 ns 1151478.5 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23460 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2292 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 169490.5 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5458 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4000 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5687.5 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6250 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 64594 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11083 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11333 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12041 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11083.5 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 444224 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6708 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6416 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7875 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6500 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 51136 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17583 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16958 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18145.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16916 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 297812 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 31896 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8916 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8667 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9250 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8645.5 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 155805 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64937.5 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 62625 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64500 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64667 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 110478.5 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 294791 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 279125 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 275479.5 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 280854.5 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 185224.5 ns
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3152041.5 ns 3387083 ns 0.93
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3026187 ns 3112854 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3022520.5 ns 2905708 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3964167 ns 3940000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573818.5 ns 570283 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7551166.5 ns 7636021 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7449979 ns 7442000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7447000 ns 7380521 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8208396 ns 8212750 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1327975 ns 1364212 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18867458 ns 13685833.5 ns 1.38
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19142541 ns 19094334 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19088834 ns 19126041 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15711167 ns 15649500.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24315583.5 ns 23644021 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33983500 ns 34568146 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37046583.5 ns 41693959 ns 0.89
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34841833 ns 34878583 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2130242 ns 1840287 ns 1.16
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 192387270.5 ns 188357375 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 163943875 ns 233488333 ns 0.70
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152577625 ns 202742250 ns 0.75
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 437847333 ns 429823895.5 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14119852 ns 13939550 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 294725229.5 ns 291377187.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 338344395.5 ns 249397167 ns 1.36
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 300590083.5 ns 300701042 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 396800708.5 ns 446062833 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23687.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23083 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24791 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23708 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95862 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103250 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103458 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 103667 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 102750 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 494978 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7083 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5750 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6875 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7000 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 67128 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15375 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15395.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16000 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14791.5 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 467877 ns
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3009166.5 ns 3055292 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2067250 ns 2092833 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2279667 ns 2283687.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4832667 ns 4895416.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 581800.5 ns 585359 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23921708.5 ns 23561833 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18037292 ns 18085229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16963187.5 ns 18562458 ns 0.91
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34623770.5 ns 35017833 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3105602 ns 3105298.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33780291 ns 33378229 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27715666.5 ns 27662145.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27451041 ns 27887458 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41640208 ns 41809854.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 80479 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72416 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78354 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74645.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100885 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 311542 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224520.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 209667 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257021 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 539235 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12500 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11708 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12542 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12833.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70648 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26667 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26958.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27333.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26625 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 470896 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12791 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12333 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12875 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52214 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25959 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25750 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26500 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26500 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 300818.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180750 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 179583 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183146 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179250 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56380 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 593542 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 582459 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 585042 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594562 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 284588 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6770.5 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5958 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7084 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70103 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14709 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14500 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15291.5 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13958 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 460969.5 ns
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1217750 ns
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1209125 ns
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1249750 ns
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1326625 ns
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302841 ns
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4351270.5 ns
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4353042 ns
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4630333 ns
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4466479 ns
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1039570 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1792 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23644 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4875 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5042 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 189061.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6021 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5708 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7042 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7416 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54998.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11437.5 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11084 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11666 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12333 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 332242 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22998 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2750 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2709 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 158762.5 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13687.5 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11208 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13958 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14125 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57325 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24625 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24250 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25500 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24875 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295945 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4166 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24912 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16084 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16209 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16333.5 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16208 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 199034.5 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5708 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5584 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5708 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5708 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33099 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21166 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20458 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21333.5 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20875 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 174613 ns
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 383042 ns
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 373541 ns
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 485896 ns
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 532854.5 ns
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66578.5 ns
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 938166 ns
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 847083 ns
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1235042 ns
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1418833 ns
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 191164 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81020.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80354.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82250 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 132458 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192525 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1945166 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1909584 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920333 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1914354.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 402795 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21790 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1791 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1916 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 172681 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8000 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6833 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8334 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7999.5 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 62227.5 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9375 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9625 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9250 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 315550.5 ns
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 159022167 ns 121038500 ns 1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174256125 ns 174268209 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147914021 ns 155647417 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102407958 ns 103289458 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5468366 ns 5459016 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 678096083 ns 592681937.5 ns 1.14
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555598625 ns 540116125 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453528479 ns 460022146 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 754205958.5 ns 623412250 ns 1.21
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34940005 ns 38146652 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 703546875 ns 751859749.5 ns 0.94
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666832020.5 ns 667614542 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 585927312.5 ns 606980437.5 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 742692916 ns 744028250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57542 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47583 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47291 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82208 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37135 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1947333 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1971042 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976458 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1893520.5 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 171380.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 272291 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 265834 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 289417 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267167 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 135867.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 671917 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596708 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 696292 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 692687.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 737698 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2231188 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2215042 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2207229 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2243770.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133226 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5572500 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5486875 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5511083 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5495666.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 759202.5 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 652833.5 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 657229 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 639500 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 639791 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46976 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1799583 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1724792 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1722792 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2103895.5 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221178.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56541 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46833 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46041 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83792 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28073 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2058250 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2078709 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093000 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1996646 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 187152 ns
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13406125 ns 13355875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12455458 ns 12430958.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12584792 ns 12600937.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14882959 ns 15122729 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 517201.5 ns 518849 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47687000 ns 47134500 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41754625 ns 41671875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40922625 ns 41125499.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58112708 ns 58336333 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3212087 ns 3218047 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74213479 ns 74376750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68010000 ns 68965000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90988625 ns 91496292 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76809750 ns 98399104 ns 0.78
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56917 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47042 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47041 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83375 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46301 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1939854 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973333 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1974729.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1884375 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189579 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 250 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 31617 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6229.5 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6167 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6458 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 171396 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 250 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31328 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2583 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2625 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2792 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161410 ns
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 324182500 ns 286107083.5 ns 1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339536042 ns 339607208 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314625854 ns 321183396 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 273060250 ns 268796333 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7093070 ns 7107764 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1051455583 ns 971792250 ns 1.08
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 941830875 ns 922480542 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 858538271 ns 835684104 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1153691292 ns 1117474583 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34020243.5 ns 33742759 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1359481562.5 ns 1448964667 ns 0.94
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1360673729 ns 1371326875 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1640965792 ns 1656412041 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1309802292 ns 1663889000 ns 0.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1414416.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1409541 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1408500 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1453875 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127358 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5056229 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5013583 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4954291 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5017021 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 601067 ns
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 170719208 ns 177405459 ns 0.96
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 132607979.5 ns 132546709 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 124493437.5 ns 130053917 ns 0.96
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 162230500 ns 165568083 ns 0.98
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4886055.5 ns 4878153.5 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 854987208 ns 643663333 ns 1.33
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 644456708 ns 496969000 ns 1.30
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 532057834 ns 558568375 ns 0.95
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 687805708 ns 654929750 ns 1.05
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16138006 ns 18110009 ns 0.89
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 9114041.5 ns
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8770313 ns
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7860292 ns
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10147292 ns
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1612586 ns
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 37546375 ns
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36886146 ns
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33451021 ns
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38875771 ns
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6459090.5 ns
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47458.5 ns
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 49333 ns
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 49583 ns
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47250 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18585 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50584 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50416 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50708.5 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50500 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 216293 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7979.5 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6791 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8875 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8583 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 106035 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10333 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9958 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10167 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 612658 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8750 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6438 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8667 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5875 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 119844.5 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13375 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13000 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13416 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12791 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 517417.5 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1042 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 958 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 31817 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8041 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8333 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 203048 ns
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23145.5 ns
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 24541 ns
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 24167 ns
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23334 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18371 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52542 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52416 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52500 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52334 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 295739.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1440625 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1400291 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1400875 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1406313 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194620 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5047479.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5003458.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4836292 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4996708 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 628014 ns
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3062438 ns 3064313 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2084417 ns 2106875 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2227208.5 ns 2301542 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4812250 ns 4944708.5 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 579246 ns 586671 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24741125 ns 25694166 ns 0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18811521 ns 20092625.5 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18691437 ns 19545895.5 ns 0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36587416 ns 36568812 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3196070 ns 3200820 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34435312 ns 35138250 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28306583.5 ns 28420084 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28069750 ns 30280062.5 ns 0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41958375 ns 42544854.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 145325041 ns
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 141848041.5 ns
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 123758375 ns
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173196604 ns
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22560824 ns
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 942531917 ns
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 871530625 ns
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1498315250 ns
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 674150833 ns
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118289465 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76208 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75041 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77875 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75417 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 273038.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 299708 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 284646 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 191687.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 202979.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1439967 ns
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 36345458 ns
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35416645.5 ns
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32239562.5 ns
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40930312.5 ns
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5849412 ns
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 151966416 ns
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152232437.5 ns
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 136165208.5 ns
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287396625 ns
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34914778 ns
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 158627833 ns 120765334 ns 1.31
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174511667 ns 174275666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148215771.5 ns 156098417 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 108212479 ns 103997770.5 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5459784 ns 5461795.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 524328229.5 ns 471697125 ns 1.11
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 467038291 ns 468205208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 441190000 ns 455789333 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 741818542 ns 728998166 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32279915 ns 35173763 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 692549750 ns 640412562.5 ns 1.08
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 656203708.5 ns 655505917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 573625208 ns 590476187.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 853537834 ns 732032000 ns 1.17
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1226937.5 ns 1249541 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 992979 ns 949958.5 ns 1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 904625 ns 764125 ns 1.18
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2085917 ns 2000458 ns 1.04
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 566912.5 ns 568299.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2909667 ns 2960792 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2628208 ns 2611021 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2006333.5 ns 2513020.5 ns 0.80
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3693750.5 ns 3690271 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1796011.5 ns 1319857 ns 1.36
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6757875 ns 6641791 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6503250 ns 6504791 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6239125 ns 6489375 ns 0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4454771 ns 4443166 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6208 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24809.5 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213666 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220313 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220125 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 209542 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 276995.5 ns
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 315354292 ns 309099062.5 ns 1.02
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221860750 ns 232469666.5 ns 0.95
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 197740833.5 ns 216377833 ns 0.91
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 312004542 ns 308762583 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676221 ns 7672114 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1085627020.5 ns 1103432604 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 891084375.5 ns 1001458208 ns 0.89
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 865730125 ns 901919771 ns 0.96
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1163266979.5 ns 1293921625 ns 0.90
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26544800.5 ns 27115979 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6083 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5583 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7375 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5270.5 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 178949 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 667282.5 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 459 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23245 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9583.5 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9167 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9458.5 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8792 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 227149 ns
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352521.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352709 ns
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352958.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352708 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21007 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 828104 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 820292 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 773500 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 828312 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 289596 ns
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 312083.5 ns
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 340166.5 ns
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 445354 ns
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 333520.5 ns
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17918 ns
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 691583 ns
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 732334 ns
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1026459 ns
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 691042 ns
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 273557 ns
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 332396 ns
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 348875 ns
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 409541 ns
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 375250 ns
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22378 ns
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 755875 ns
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 743000 ns
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1068417 ns
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 822124.5 ns
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 239682 ns
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3625 ns
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3417 ns
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3583 ns
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17823 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4208 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4167 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4375 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4292 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 271995 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4792 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3834 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5250 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3625 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 214003.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8354.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8334 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1200425 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204209 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210000 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211875 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199417 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34086 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 608520.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 620750 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 620416 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628625 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347622 ns
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 980000 ns
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 929916.5 ns
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 954250 ns
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1278542 ns
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206777 ns
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4651729 ns
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4500083 ns
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4296645.5 ns
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6216979.5 ns
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 942518 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3916 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3375 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4667 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3354.5 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 231395.5 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7000 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1002762 ns
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1644583 ns 1618667 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1174458 ns 1189854.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1323125 ns 1358375 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2461333.5 ns 2360458 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213304.5 ns 211422.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12444729.5 ns 12284958.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9564709 ns 9550979.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9234833 ns 9390791 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18020417 ns 18060041.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1940786 ns 1906624.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17431792 ns 17280916 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14392958.5 ns 14329167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14240000 ns 14463083 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21049562.5 ns 21088375 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 90625 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 88041 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92333 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 136917 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125618 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2061125 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018458 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1720042 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024104 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1024038 ns
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 331312 ns
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 343500 ns
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 395083 ns
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 310458.5 ns
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15733 ns
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 699959 ns
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 722062.5 ns
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1018209 ns
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 646375 ns
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 189475.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5875 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33239 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221625 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219959 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219750 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218375 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 314279 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22722 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14167 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14334 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14291 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14375 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 475447 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 95166.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91833 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96125 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 139167 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125450 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1948250 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1921104.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1669729.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920708.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 954893.5 ns
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 854375 ns 861145.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 817542 ns 826334 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1213833.5 ns 1164604.5 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 958895.5 ns 959395.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 276078 ns 263975.5 ns 1.05
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2843334 ns 2730708 ns 1.04
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2456145.5 ns 2455708.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3332000 ns 3317604.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3419792 ns 3286521.5 ns 1.04
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1629171 ns 1038213 ns 1.57
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15333 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14709 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17041 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14333 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142609.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 262125 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215416.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215250 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221958 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 641081.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221583.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 218625 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222833 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221750 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 271537.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 497750 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 494833 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497084 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 509000 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1365399 ns
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 315729 ns
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 333917 ns
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 375125 ns
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 322083 ns
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16846 ns
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 710041 ns
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 725063 ns
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1022417 ns
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 663021 ns
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 196884 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17625 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16708 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18792 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17625 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144721 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220104.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212792 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212750 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217250 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 955774 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6042 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4250 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6958 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6541 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 245177 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10583.5 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10250 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10708 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10084 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1099715 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4542 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3208 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4834 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 2875 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 250616.5 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1110249 ns
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24293729.5 ns 23602937.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34647499.5 ns 34462041.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 38065167 ns 41206708 ns 0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34799687.5 ns 34998812.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1834951 ns 1861561 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187799375 ns 184955020.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159175458 ns 159249771 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146555271 ns 150499917 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 415008291 ns 390550250 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16504056.5 ns 16472871 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 437855250 ns 286689500 ns 1.53
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 254443000 ns 244388646 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231693624.5 ns 296120917 ns 0.78
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 485497958 ns 440533417 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184229.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181916 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184084 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182167 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230730 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 637084 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 586270.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586583 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631542 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1097701 ns
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3894562.5 ns
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3827292 ns
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3469958 ns
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5353020.5 ns
batchedmm(128, Bsize=512)/forward/GPU/CUDA 535365 ns
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 18146250 ns
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17166041.5 ns
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16601417 ns
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22202083 ns
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2616593 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 458 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 500 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32123 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9458 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8667 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9167 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9208 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 267754 ns
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 580762562.5 ns 624998521 ns 0.93
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 427173312.5 ns 477642917 ns 0.89
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 376948624.5 ns 411867812.5 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 671986666.5 ns 656030104 ns 1.02
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12479261 ns 12477905 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2061821458.5 ns 1873735437.5 ns 1.10
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1626836125 ns 1636021583 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1500724875 ns 1558895000 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2217147562.5 ns 2103890062.5 ns 1.05
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 48947892 ns 49609571 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1651250 ns 1650167 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1196959 ns 1195708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1346187.5 ns 1388458 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2356042 ns 2498125 ns 0.94
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 218070 ns 218867 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12822417 ns 12700771 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9953541.5 ns 9962124.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9605000 ns 9800459 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18408062.5 ns 18403354 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2047696.5 ns 1957280 ns 1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17771104.5 ns 17702708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14762729 ns 14737000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14473917 ns 14865041 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21336042 ns 21477333.5 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26209 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26583 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26209 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24922 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66792 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67000 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66791 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66916 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 410676.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203542 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210583 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210500 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199958 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26405 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 602333 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 621292 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621250 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630584 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355627 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657646 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 638729 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 544125 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 677396 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132242 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2305542 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2254292 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1426250 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2248542 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1182706 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17937.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17042 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19500 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16895.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144900 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220000 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218416.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219458 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 261708 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1051792 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23475 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9520.5 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9541 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10166 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9375 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 261505 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6542 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5292 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6625 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7416 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 235631 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7000 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7291 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7250 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7208 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 803793 ns
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2334 ns
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2041 ns
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2292 ns
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2333 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18245.5 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6750 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6459 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6667 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6625 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 333087.5 ns
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748458 ns
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746645.5 ns
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 746833 ns
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749417 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21817 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 789125.5 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 772625 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775145.5 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 787875 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 298327 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7291 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5959 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5750 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10792 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32858 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221541 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226958 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 226625 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220292 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 360131.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10250 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9917 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12459 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10583.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 243730.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24834 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24833.5 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24750 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24666 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1133764 ns
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 107061375 ns 106272125 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116928479.5 ns 117220895.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121136000 ns 123891541 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117635875 ns 117462292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2659433 ns 2638590.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 396814083.5 ns 390984854 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366591458 ns 370181584 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 425794499.5 ns 344393625 ns 1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 482285959 ns 481330584 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15258375 ns 15192721.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 769963270.5 ns 619409458 ns 1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 576371708 ns 668415479 ns 0.86
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 745582312 ns 816519375 ns 0.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 765495854.5 ns 916595917 ns 0.84
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7333 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6334 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7750 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8333 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 237972 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14125 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13209 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13417 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13459 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1080162 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7667 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5583 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8167 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8291 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 233794.5 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12542 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11875 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12645.5 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11875 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 787815 ns
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 332667 ns
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 344396 ns
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 395770.5 ns
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 312500 ns
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16497 ns
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 706958.5 ns
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 725208 ns
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1019750 ns
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 658292 ns
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 198046.5 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 22951 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6208 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6792 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6208 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 237567.5 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5709 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5667 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5875 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5667 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24038 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21958 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 20875 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21625 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21125 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 260574.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146812.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143875 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145917 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 178146 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166659.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1355917 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1329374.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 861416.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1325916 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1338261 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23084 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21458 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24042 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23958 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 350919.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 179500 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 120541 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118167 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 151208 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1454020.5 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22580 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6334 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6791 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6208 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 253799.5 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5042 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4250 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5833.5 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4666 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 254794.5 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10042 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1352736 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1542 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23495 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5708 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5750 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 273637.5 ns
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6842458 ns 6779291.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6343020.5 ns 6365500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6507417 ns 6531583 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7623042 ns 7635875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213659 ns 210025 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24131500 ns 24055375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21298104 ns 21237625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21004749.5 ns 21535792 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29792896 ns 29721771 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2117701 ns 1973993 ns 1.07
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37668083 ns 37426416 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34323688 ns 34385895.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45641000 ns 45888792 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38230313 ns 49367041.5 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6459 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5250 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7500 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7458 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 235380.5 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8541 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7792 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9208 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1057995 ns
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1525083 ns 1528208 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1258604.5 ns 1277937.5 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1613917 ns 1635937.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2159167 ns 2136917 ns 1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA 273469.5 ns 277390.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7971979 ns 7872250 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6561833.5 ns 6588000 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7004875 ns 7229396.5 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10476458 ns 10478041 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1860749 ns 1130644 ns 1.65
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 326083.5 ns
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 347292 ns
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 379020.5 ns
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 343562.5 ns
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46613.5 ns
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 745458 ns
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 781417 ns
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1067437.5 ns
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 751125 ns
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 306721.5 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396333 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287916 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288062.5 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751542 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43483 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 646375 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 531834 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 530042 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973417 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188389 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 653542 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 639041.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 545542 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 655584 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131455.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2529917 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2399708 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2436833 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2460520.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1513461 ns
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 323146 ns
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 343771 ns
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 394750 ns
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 310562 ns
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15996 ns
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 699000 ns
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 717792 ns
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1016334 ns
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 649937 ns
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 196510 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458958 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1506167 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1503458 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1442834 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39862 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5157334 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5010437.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4993104 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4988542 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197580.5 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 32748 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14833 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15125 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15292 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15041 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 374855 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71625 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71333 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71333 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71333 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113422 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 326208 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318250 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 319375 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317917 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 192316 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 959 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23450 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8042 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7895.5 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8333 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7792 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 258455 ns
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 465250 ns
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 472750 ns
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 547875 ns
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 554667 ns
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130091 ns
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1420208 ns
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1378895.5 ns
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1600250 ns
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1587791 ns
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274988 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 334 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 292 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31336 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 5959 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6354.5 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6166 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 261129.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1730708 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1721229.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1723750 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1730229 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168441.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4400167 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4366354 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3903958 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4358458 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1240708 ns
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6792 ns
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6584 ns
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6833 ns
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 14542 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20531 ns
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32708 ns
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 67708 ns
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 32833 ns
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51667 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 291979.5 ns
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 336292 ns
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 347187.5 ns
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 415021 ns
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 324666.5 ns
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18102.5 ns
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 718416.5 ns
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 727250 ns
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1030292 ns
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 672709 ns
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 346719.5 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75667 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75208 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75375 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75000 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46739 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 333209 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 331291 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 332729.5 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324292 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 208913 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1483875 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1531875 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1529458 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1467834 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51266 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5149875 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5290166.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5287000 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4982583 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202737.5 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28291 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28167 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28291 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24497 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66625 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66542 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66500 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 532969 ns
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1260875 ns 1396583.5 ns 0.90
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1118417 ns 1097333 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1056541 ns 939062.5 ns 1.13
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2256375 ns 2231792 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 573252 ns 574483.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3028208 ns 2873417 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2726937.5 ns 2715208 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2733875 ns 2626645.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3818500 ns 3813542 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1997088 ns 1401203 ns 1.43
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8958062.5 ns 8821895.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8813834 ns 8770604 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8742917 ns 8763666.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6350021 ns 6350229.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82895.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80270.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82875 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80167 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192999 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2045708.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026499.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2015875 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2005042 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 797613 ns

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.