Skip to content

Commit

Permalink
docs: update exporting_to_jax.md (#1107)
Browse files Browse the repository at this point in the history
  • Loading branch information
wsmoses authored Nov 26, 2024
1 parent 05523d6 commit 78ad9c9
Showing 1 changed file with 2 additions and 8 deletions.
10 changes: 2 additions & 8 deletions docs/src/manual/exporting_to_jax.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ end
Now we define a python script to run the model using EnzymeJAX.

```python
from enzyme_ad.jax import primitives
from enzyme_ad.jax import hlo_call

import jax
import jax.numpy as jnp
Expand All @@ -81,7 +81,7 @@ def run_lux_model(
weight6_3,
bias6_3,
):
return primitives.ffi_call(
return hlo_call(
x,
weight1,
bias1,
Expand All @@ -93,13 +93,7 @@ def run_lux_model(
bias6_2,
weight6_3,
bias6_3,
out_shapes=[
jax.core.ShapedArray([4, 10], jnp.float32),
],
fn="main",
source=code,
lang=primitives.LANG_MHLO,
pipeline_options=primitives.JaXPipeline(""),
)


Expand Down

1 comment on commit 78ad9c9

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 78ad9c9 Previous: 161b64c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4291 ns 4042 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3958 ns 4042 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5125 ns 5000 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4250 ns 3917 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60770 ns 60335 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10250 ns 10292 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 9958 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 10917 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10334 ns 9917 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 423675 ns 425045 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1250 ns 0.90
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1166 ns 1125 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1229.5 ns 1417 ns 0.87
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1250 ns 1083 ns 1.15
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 17992 ns 17905 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4250 ns 4083 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4000 ns 4000 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4167 ns 4375 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3958 ns 3916 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109284 ns 109347 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57417 ns 56292 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38208 ns 46833 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46375 ns 46229.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80167 ns 81458 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36667.5 ns 36705 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2021709 ns 2055229.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097000 ns 2092146 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2077875 ns 2088791.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001000 ns 2005459 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195812 ns 195507 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 145166.5 ns 175854 ns 0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 142666 ns 144666 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146500 ns 145708 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144167 ns 141167 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165803 ns 165651 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1104750 ns 1150750 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1156062 ns 1127354.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1104750 ns 1114250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1129458 ns 1116458.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 527714 ns 529529 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 3208 ns 1.25
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3417 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4208 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3459 ns 3334 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70555.5 ns 70388 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9084 ns 8875 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8709 ns 9500 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9667 ns 9750 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9167 ns 9250 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 481518.5 ns 494790 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15416 ns 15209 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16958 ns 15000 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16791.5 ns 17209 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14792 ns 14688 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54315.5 ns 54580 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213958 ns 216291.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214042 ns 214167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214208 ns 213416 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214334 ns 225708.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 273628 ns 274273 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 709 ns 0.71
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 834 ns 0.80
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583.5 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17264 ns 17190 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1500 ns 1709 ns 0.88
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1625 ns 1375 ns 1.18
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1792 ns 1833 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1708 ns 1584 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 102318 ns 102235 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7000 ns 7125 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5084 ns 5958 ns 0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5916 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9916 ns 9958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23961 ns 23722 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221542 ns 222833 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229708.5 ns 227958 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229667 ns 229500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226542 ns 213417 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170388 ns 169452 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 4000 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23385 ns 23542 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16625 ns 16834 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16500 ns 16709 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17000 ns 16959 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16833 ns 16666 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 161544 ns 162915 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 581791 ns 571709 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 578709 ns 574917 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 569958 ns 573708 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 572333.5 ns 568500 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113621 ns 113185.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1428958 ns 1427354.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1421292 ns 1431625 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1415833 ns 1423541 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1420000 ns 1422542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 210533 ns 211963 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1081750 ns 1046896 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 938708 ns 967000 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1353291.5 ns 1344687.5 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1296666 ns 1304958 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 269675 ns 275060 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5971292 ns 5993167 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4530771.5 ns 4544458 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4949917 ns 4946959 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5624041 ns 5568042 ns 1.01
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1072622 ns 1091420 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23468 ns 23913 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2209 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2208 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 169303 ns 169337.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4167 ns 4417 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4208 ns 3833 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4708 ns 4625 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4125 ns 3958.5 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66233.5 ns 65443 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11125 ns 11833 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11250 ns 11250 ns 1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 11958 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10792 ns 11125 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 452338 ns 450871 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6292 ns 7208 ns 0.87
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6417 ns 6958 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7604.5 ns 7417 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 6333 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52542 ns 51992 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18583 ns 18459 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17500 ns 17500 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18833 ns 17833 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16833 ns 17459 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 301964.5 ns 300918 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 667 ns 0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32911 ns 32212 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8625 ns 9084 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8542 ns 9437 ns 0.91
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9125 ns 9333 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8917 ns 8958 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 160010 ns 158990.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64500 ns 64208 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64666 ns 64833 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64500 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64500 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112101 ns 111823 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 279458 ns 282708 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 288583 ns 279000 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 273583 ns 273166 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 286083 ns 281437.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 185547.5 ns 186218.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3376750.5 ns 3136750 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2898291.5 ns 3023208 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3024854 ns 3030188 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3941104 ns 3954583.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 581323 ns 576992 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7603583 ns 7597041.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7358750 ns 7419792 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7466208 ns 7452395.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8146792 ns 8186583.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1318419 ns 1367306 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17484792 ns 17658333 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17670999.5 ns 17553062.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17533250 ns 17551250 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 9220187.5 ns 14310208 ns 0.64
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23603916 ns 23729167 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43639208 ns 33388291 ns 1.31
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37125083 ns 37228104.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34980187.5 ns 34843354 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1854234 ns 1868338 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188207417 ns 192271333 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 251666438 ns 232983250 ns 1.08
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 194864208 ns 191886562.5 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434287708 ns 435397084 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13931919 ns 13905970 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 287943833 ns 291433625 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 355406479.5 ns 336814583 ns 1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297803834 ns 297436208 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 400767145.5 ns 408923438 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22458 ns 22583 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22208 ns 24708 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25041 ns 23209 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22270.5 ns 21625 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96107.5 ns 99141.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 113166.5 ns 103334 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104292 ns 103750 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105083 ns 105083 ns 1
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103812.5 ns 103062.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 502678.5 ns 520213.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6833 ns 6000 ns 1.14
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6479.5 ns 5958 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7041.5 ns 6958 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5958 ns 5708 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68593 ns 69364 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15000 ns 15042 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15479 ns 15209 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16333 ns 16250 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14708.5 ns 15083 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 475032.5 ns 484888 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3031167 ns 3057208.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2061583 ns 2066208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2253209 ns 2260437.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4505270.5 ns 4508458 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586394 ns 589772 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23625708.5 ns 23926959 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18333062.5 ns 18026875 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17998916.5 ns 18022708 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35608125.5 ns 35506041.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2764773.5 ns 2765084 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33284000 ns 33917958 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28078500 ns 27599646 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28952938 ns 28534208 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41446187.5 ns 41643583.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72167 ns 74541.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 81083 ns 74313 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 86562.5 ns 74500 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75479 ns 72291 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104806 ns 104269 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223458.5 ns 317750 ns 0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 325166 ns 208562.5 ns 1.56
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 320958 ns 322375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 210500 ns 291583.5 ns 0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 552193 ns 562266.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11917 ns 11875 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12583 ns 11625 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12708 ns 13250 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12083 ns 12125 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71752 ns 72944 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26667 ns 27208 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26583 ns 26791.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28000 ns 27833.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26500 ns 26750 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 476956.5 ns 485353 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11667 ns 13458.5 ns 0.87
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12333 ns 12375 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12917 ns 13250 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11834 ns 12291 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53475 ns 54559.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25792 ns 26417 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25500 ns 25959 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26500 ns 26209 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26000 ns 26000 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 305905.5 ns 311166.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 181458 ns 181458 ns 1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180541 ns 179708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184604.5 ns 183437.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179667 ns 181354 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57257.5 ns 58673.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 592917 ns 597521 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587687.5 ns 584083 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595750 ns 583958.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582791.5 ns 582625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 291107 ns 295518 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8958 ns 6125 ns 1.46
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6583 ns 5958 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8042 ns 7333 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6375 ns 6166.5 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 71199.5 ns 71636.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13916 ns 15312.5 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14875 ns 14333 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15459 ns 15708 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13958.5 ns 13958 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 465947 ns 473061 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1219708 ns 1205708 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1231750 ns 1241125 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1269667 ns 1286479 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1009666 ns 1000208 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 300921 ns 301351 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4103750 ns 4319770.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4571833 ns 4471334 ns 1.02
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4574959 ns 4578416 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3707208 ns 3698417 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1038858 ns 1037486.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1834 ns 1916 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23656 ns 24166 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4834 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 5083 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 190147.5 ns 194650 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 6583 ns 0.82
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5708.5 ns 6208 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6917 ns 7125 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5437.5 ns 5750 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56411.5 ns 56615.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10750 ns 12209 ns 0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11000 ns 10895.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11834 ns 11667 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10729.5 ns 11000 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 336162 ns 336343 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22819 ns 23536 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 3042 ns 0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 2791 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 3042 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 2750 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 159135.5 ns 163558.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11458 ns 12083 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11333 ns 11375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12750 ns 14667 ns 0.87
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11208 ns 11500 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58102 ns 58066.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24750 ns 25541 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24334 ns 24250 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25084 ns 25125 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24750 ns 24458 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 298883.5 ns 299332 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4209 ns 4208 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4209 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4291 ns 4209 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24823 ns 25749 ns 0.96
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16084 ns 16125 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 15959 ns 16166 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16291 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 197271 ns 200089.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5791 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5916 ns 5959 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5833 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34115 ns 34238 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20500 ns 21125 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20417 ns 20459 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21250 ns 21167 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20708 ns 20812.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178582.5 ns 179917 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 423708.5 ns 397270.5 ns 1.07
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 366416.5 ns 384187.5 ns 0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 484917 ns 478583.5 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 103541 ns 103333 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67022 ns 67557 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 943375 ns 891750 ns 1.06
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 950687 ns 972959 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1197916.5 ns 1184041.5 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 330416.5 ns 330499.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 193979 ns 194177 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80541.5 ns 79812.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81125 ns 81209 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81541.5 ns 84042 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80479.5 ns 79916.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194031 ns 194547.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919833 ns 1931812.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1936958 ns 1636646 ns 1.18
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1930229 ns 1918646 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923250 ns 1926062 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 400084 ns 403673 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21834 ns 22738 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1750 ns 1792 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 168563 ns 173787 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6416 ns 7041 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6166 ns 6666 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7667 ns 7666 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6709 ns 6666 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61087.5 ns 61338.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8959 ns 9459 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 9208 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 9333 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9312.5 ns 9500 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 309875.5 ns 310208.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 118672458 ns 155906937.5 ns 0.76
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 182326458 ns 174332958 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148081791.5 ns 147872625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102035042 ns 105277000 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5467326.5 ns 5483548 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 610447729.5 ns 669282000 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 582022188 ns 555382333 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 452913708.5 ns 453291791.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 751418979 ns 761771979 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34971564 ns 35124637 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 646694167 ns 699486584 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 688250333 ns 668241854.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 583281666.5 ns 612942458.5 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 744581417 ns 744149959 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59000 ns 56292 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 37792 ns 47709 ns 0.79
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47750 ns 47584 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83417 ns 83167 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38231 ns 37949 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1925854 ns 1925646.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1987562.5 ns 1981729 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1779021 ns 1969458.5 ns 0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1864125 ns 1898709 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175192.5 ns 177394.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 292250 ns 269708.5 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268916 ns 268625 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 269500 ns 287000 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 266000 ns 267041 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 128884 ns 125253 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 686771 ns 681916.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 702187.5 ns 693791 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 591083 ns 682500 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 688958 ns 685958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 706872 ns 675851.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2268958 ns 2214458 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2245875 ns 2234583 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2101125 ns 2206291 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2176375 ns 2191792 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133295.5 ns 134149 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5521229.5 ns 5560291 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5587167 ns 5498375 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5520666.5 ns 5509646 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5493834 ns 5459417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 748599 ns 719852 ns 1.04
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 642084 ns 658625 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 648917 ns 643125 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 636667 ns 639042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 635875 ns 637666 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46696 ns 47328 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1822625 ns 1793125 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1670333 ns 1725000 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1719875 ns 1723687.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2097416.5 ns 2098895.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221082 ns 225375 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57833 ns 56875 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38500 ns 47416 ns 0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46250 ns 47125 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82750 ns 83625 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28653 ns 29103 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020167 ns 2036833 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2105417 ns 2094667 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093958 ns 2075625 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999958.5 ns 2003333 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190261 ns 192100 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13356563 ns 13402000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12441584 ns 12431542 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12535208 ns 12506125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15154375 ns 14837542 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512188.5 ns 516101 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47248458 ns 47711000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42098688 ns 42011395.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40986395.5 ns 40917708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58394208 ns 58129729.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2891115 ns 2890593.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74033603.5 ns 97106625 ns 0.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68368417 ns 68523125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90690875 ns 90562125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76143146 ns 76819625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58250 ns 57208 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38583 ns 47750 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47625 ns 47250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79125 ns 82041 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47024 ns 47330 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918250 ns 1935667 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1983396 ns 1983791 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1965584 ns 1973041.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1830750 ns 1878417 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192100.5 ns 195219 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 417 ns 0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 334 ns 334 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32257 ns 32542 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6083 ns 6750 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6000 ns 6125 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6416 ns 6625 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6104.5 ns 6375 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 172267 ns 170591.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 334 ns 0.75
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31372 ns 32528 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2958 ns 0.89
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2625 ns 2667 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2666 ns 2625 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 158332 ns 158771.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 283213208 ns 321043354.5 ns 0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 347751604 ns 340532834 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314361479.5 ns 314151312.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 273430250 ns 270601541 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7090888 ns 7107105.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 992205416 ns 1046677708.5 ns 0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 964468250 ns 945289167 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 838327667 ns 840954313 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1152689375 ns 1155312792 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34106482 ns 34104665 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1303968312.5 ns 1718615541 ns 0.76
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1327504666.5 ns 1335253333.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1629886334 ns 1620256500 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1314925417 ns 1333409458.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1455709 ns 1460479.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1463125 ns 1422584 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1415166.5 ns 1418083.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1410000 ns 1412208.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127607 ns 127814.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5015979 ns 5051916 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5060792 ns 5033458.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5051500 ns 5025999.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5009458 ns 5025125 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 574399.5 ns 500081 ns 1.15
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 170351312 ns 162840083 ns 1.05
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 167663375 ns 128019708.5 ns 1.31
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 130848583.5 ns 130269666 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 167905166.5 ns 152687687.5 ns 1.10
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4881672 ns 4884899 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 618588292 ns 844540708 ns 0.73
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 577882000 ns 537349833 ns 1.08
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 497505667 ns 560583292 ns 0.89
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 647917125 ns 649437458 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16266169 ns 17863022 ns 0.91
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8910542 ns 9095833.5 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9026291.5 ns 8979250 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7927084 ns 7868500 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9711125 ns 9713958 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1592738 ns 1593097 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35730646 ns 37599479 ns 0.95
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 38522375 ns 37114520.5 ns 1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33553041 ns 33537625 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37755625 ns 37598895.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6512589 ns 6454775 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47333 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47333 ns 47500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47334 ns 47666 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47875 ns 47375 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18035 ns 18487 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 52792 ns 50417 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50292 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50458 ns 50500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50667 ns 50333.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 197012 ns 161534 ns 1.22
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 7854.5 ns 0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6250 ns 6770.5 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7417 ns 7729.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6750 ns 7083 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 112280 ns 73765 ns 1.52
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9584 ns 10209 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9458 ns 10375 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 10042 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10209 ns 10000 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 615930.5 ns 437389 ns 1.41
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5416 ns 6875 ns 0.79
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5791 ns 6458 ns 0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7146 ns 8250 ns 0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5959 ns 5583.5 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 123840 ns 81756 ns 1.51
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12583 ns 13791 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12750 ns 13500 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13208 ns 13541 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12708 ns 12895.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 529723.5 ns 408231 ns 1.30
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32491 ns 32689 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 8500 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns 7667 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8209 ns 8000 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7959 ns 8042 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 209838 ns 192936.5 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23417 ns 23459 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23041 ns 23208 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23584 ns 23375 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23417 ns 23542 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18029 ns 18259 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 54667 ns 52750 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52417 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52667 ns 52791.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52458 ns 52292 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 299710 ns 228166 ns 1.31
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1444833 ns 1407187.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1449584 ns 1444833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1399209 ns 1405083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1396958.5 ns 1396895.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195765 ns 196465 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5000042 ns 5040708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5049833 ns 5018541 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5044562 ns 5002417 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5015291.5 ns 5013625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 612366.5 ns 546168 ns 1.12
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3043104 ns 3079083 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2098583 ns 2047000 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2313209 ns 2294458.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4606709 ns 4540917 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580804.5 ns 582581 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24374458 ns 24731020.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19110937.5 ns 18912562.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18926833 ns 19038249.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36250750 ns 36828979.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2861963.5 ns 2836262.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33972875 ns 34546958.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28642167 ns 28342834 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28092229 ns 28021500.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41633541.5 ns 41446459 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 141888875 ns 144151542 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 146034209 ns 148019541 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126705062.5 ns 125949729 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173781771 ns 173005021 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22552094 ns 22565027 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1227732750 ns 948587416.5 ns 1.29
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 839227916.5 ns 1316893208.5 ns 0.64
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 739276458 ns 846166625 ns 0.87
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 683957250 ns 681952500 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 117875105 ns 118678990 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73084 ns 76499.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74479 ns 80646 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75750 ns 75541.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74958 ns 72583 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 240665.5 ns 219501.5 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 280208.5 ns 295875 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 288959 ns 203584 ns 1.42
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 193791 ns 292875 ns 0.66
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 192583 ns 288125 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1331151 ns 1030687 ns 1.29
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35557542 ns 36242145.5 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36592625 ns 36566979.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32410750 ns 32367458.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40376458 ns 40164416.5 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5838475 ns 5846818 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148073500 ns 152632458 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 158619999.5 ns 152676896 ns 1.04
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 139542333.5 ns 139286062.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 282659625 ns 283773000 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34873454 ns 34916870 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120976041.5 ns 156722375.5 ns 0.77
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 182674416.5 ns 173916792 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147566209 ns 148066500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105641958.5 ns 102175416 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5456587 ns 5486669 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 471084687.5 ns 519305021 ns 0.91
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 489605103.5 ns 467283583 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 432706750 ns 441689083 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 737367000 ns 742430042 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32284178 ns 32276395 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 707739104.5 ns 688401084 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 677702687.5 ns 657912104.5 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 572041062.5 ns 573100917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 735458208 ns 731550292 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1303791.5 ns 1195458.5 ns 1.09
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 778750 ns 988250 ns 0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 904854 ns 987583 ns 0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1945625 ns 2066875 ns 0.94
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 581135.5 ns 585359 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2961271 ns 2919770.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2515584 ns 2614875 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2624334 ns 2611792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3695417 ns 3691417 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1838423 ns 1640515 ns 1.12
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5788229.5 ns 5907500 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5903625 ns 5785541 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5805354.5 ns 5799666 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2899667 ns 2887792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7167 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5250 ns 6125 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 6167 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9916 ns 10000 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25653 ns 25666.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212479.5 ns 213834 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226833 ns 221083 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220417 ns 220958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206167 ns 209500 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 275653 ns 224627 ns 1.23
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 307447667 ns 310292438 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 279760625 ns 228430666 ns 1.22
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198268687.5 ns 199615625 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 308090500 ns 310121208 ns 0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7673335 ns 7680035.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1074946146 ns 1101205687.5 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1069981500 ns 904614354 ns 1.18
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 801953875 ns 806439375 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1147606167 ns 1160007708.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26674789 ns 26999631 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4958 ns 6458 ns 0.77
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5208 ns 6041 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5958 ns 6083 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5042 ns 4895.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 169081.5 ns 119636.5 ns 1.41
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 7833 ns 0.87
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6917 ns 7292 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7500 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7125 ns 7166 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 666084 ns 510149.5 ns 1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 666 ns 0.88
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 708 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24582 ns 24235 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9125 ns 9625 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8459 ns 9583 ns 0.88
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9084 ns 9291 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9041 ns 9041 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 231180 ns 191615 ns 1.21
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352416.5 ns 352542 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351792 ns 351541.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 354500 ns 354208 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352125 ns 352041 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21300.5 ns 21082 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 814416 ns 827625 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 809021 ns 774417 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 782042 ns 830187 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 827334 ns 822209 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 305499.5 ns 224458.5 ns 1.36
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 336479.5 ns 315667 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 321125 ns 337708 ns 0.95
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 450500 ns 448542 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10542 ns 11375 ns 0.93
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18195 ns 18423 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 721208 ns 705604.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 733229 ns 738958.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1007271 ns 999000 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26666 ns 26459 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 274145 ns 211965.5 ns 1.29
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 383062 ns 360167 ns 1.06
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 329312 ns 346666 ns 0.95
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 442417 ns 437417 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30792 ns 30125 ns 1.02
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22813 ns 22977 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 737625 ns 727167 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 785604 ns 782250 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1032042 ns 1026667 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 105375 ns 90000 ns 1.17
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 222871.5 ns 196309.5 ns 1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3708 ns 3625 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3417 ns 3541 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3666 ns 3625 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3458 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17737 ns 18016 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4417 ns 4541 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4209 ns 4375 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4292 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4292 ns 4250 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 278790 ns 210663.5 ns 1.32
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3791 ns 3750 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3604.5 ns 3625 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4145.5 ns 4500 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3666.5 ns 3708 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 207112 ns 158953 ns 1.30
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8750 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 8167 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8875 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8458 ns 8375 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1220818 ns 976072 ns 1.25
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203687.5 ns 203542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210041 ns 211375 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210625 ns 212125 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200708 ns 200042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34937 ns 35273 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 645270.5 ns 649750 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 631770.5 ns 622083 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622458 ns 673000 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630750 ns 628584 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 343085 ns 286304.5 ns 1.20
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1001750 ns 1006541.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1034729 ns 1012562.5 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 956333 ns 950084 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 879958 ns 867374.5 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207672.5 ns 208692 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4524208 ns 4662333 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4821708 ns 4724042 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4482250 ns 4460291 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5132979 ns 5133479.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 922465 ns 931046 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3666 ns 3750 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3292 ns 3416 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3417 ns 3875 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3000 ns 1.19
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 232276 ns 160179 ns 1.45
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7708 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 7000 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7500 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 6917 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1014308 ns 834512 ns 1.22
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1651708 ns 1638021 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1164875 ns 1178750.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1344708 ns 1368583 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2500875 ns 2435458 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214937 ns 212757 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12379084 ns 12417125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9615125.5 ns 9573771 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9247041 ns 9272896 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18054792 ns 18032250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1946109 ns 1947684.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17413000 ns 17407875.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14415146.5 ns 14413792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14339250 ns 14355521 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21151646 ns 21131291.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134917 ns 89791 ns 1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 88958 ns 90333 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91334 ns 91667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87666 ns 88604 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126488 ns 125843 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2026792 ns 2042500 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2043625 ns 2024209 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1766792 ns 2017334 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026459 ns 2030458 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1034650 ns 851622 ns 1.21
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 2770.5 ns 1500 ns 1.85
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1334 ns 2250 ns 0.59
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3208 ns 3833 ns 0.84
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 3791 ns 2250 ns 1.68
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16389 ns 15376 ns 1.07
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2584 ns 2916 ns 0.89
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2459 ns 2459 ns 1
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2709 ns 2791 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2791 ns 2917 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 192723.5 ns 153882 ns 1.25
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5208 ns 6000 ns 0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 6000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9959 ns 10000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34193 ns 33856.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225250 ns 221791 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227063 ns 220646 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220708 ns 220479.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213333 ns 241958.5 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 312634.5 ns 266253.5 ns 1.17
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22321 ns 22475 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14417 ns 14209 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14250 ns 14333 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14416 ns 14459 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14375 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 475484 ns 372905 ns 1.28
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134292 ns 96208 ns 1.40
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93667 ns 95604 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 94354.5 ns 96583.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91958 ns 91812.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125921 ns 125359 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924541.5 ns 1942458 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1939333 ns 1923146 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1709625 ns 1909167 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1925042 ns 1932625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 949226.5 ns 780596.5 ns 1.22
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 874708 ns 859584 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 796250 ns 815917 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1220958 ns 1209375 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 963208 ns 960270.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 277966 ns 271785 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2838542 ns 2844229 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2538917 ns 2490542 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3341125 ns 3348000.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3415500 ns 3404749.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1590492.5 ns 1487247 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17646 ns 17416.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16500 ns 17416 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18042 ns 18333 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17333 ns 17604.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142389.5 ns 140524.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226250 ns 261583 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 239208.5 ns 215667 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215666.5 ns 257416.5 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227708 ns 215792 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 648593.5 ns 572039 ns 1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222666 ns 222625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220083 ns 222062.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222792 ns 222146 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221875 ns 220833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 275688.5 ns 232890 ns 1.18
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 564542 ns 507750 ns 1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 507292 ns 501667 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 506333 ns 556500 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 559542 ns 507875 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1323540.5 ns 1207816 ns 1.10
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4229.5 ns 4292 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 3958 ns 4042 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 3916 ns 4541.5 ns 0.86
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4333 ns 3875 ns 1.12
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16749 ns 16753 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7187 ns 7625 ns 0.94
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 6917 ns 7125 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7292 ns 7167 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7416 ns 7270.5 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 193558 ns 176857.5 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19333.5 ns 18542 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17167 ns 16958 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19291 ns 19792 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16959 ns 16562.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145420.5 ns 145193.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223917 ns 224708 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216437.5 ns 211854 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215375 ns 238145.5 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213812.5 ns 212042 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 914033 ns 888620 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4958 ns 4917 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4250 ns 4208 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4417 ns 5042 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3917 ns 3667 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 206416 ns 184696.5 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10250 ns 10875 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 10584 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10958 ns 11042 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10000 ns 10458 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1027488.5 ns 966049 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3833 ns 3645.5 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3459 ns 3209 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3416 ns 3792 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3250 ns 2833 ns 1.15
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 236791.5 ns 188943.5 ns 1.25
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 8000 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7250 ns 7125 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7958 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 7208 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1067899 ns 1007792.5 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23463750.5 ns 24183291.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43484791.5 ns 34946479 ns 1.24
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37835875 ns 37338083 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34880875 ns 34888125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1833754 ns 1782868.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184463792 ns 186454375 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 172964124.5 ns 159896583 ns 1.08
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146554521 ns 145990104.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 410369375 ns 411376042 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16525549 ns 16457564 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 424815979 ns 432652834 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 259769792 ns 247809833 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297288958 ns 279749334 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 478383791 ns 479974375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183959 ns 183958.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183375 ns 182375 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186187.5 ns 185500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183187.5 ns 184062.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 205888.5 ns 172057.5 ns 1.20
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 602916.5 ns 637709 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596416.5 ns 586041.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 592375 ns 639084 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596542 ns 596416 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1054788 ns 1002959 ns 1.05
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3829562.5 ns 4026750 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3998791.5 ns 3920250 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3564812.5 ns 3579209 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4550791.5 ns 4570291.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532059.5 ns 532647 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17302667 ns 17895041.5 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18565313 ns 17836083 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16600312.5 ns 16489292 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20208979.5 ns 20147270.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2631431 ns 2607011.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33095 ns 32522 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 9729 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9042 ns 9291 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9458.5 ns 9625 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9125 ns 9209 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 266296 ns 258262.5 ns 1.03
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 498097750 ns 503041917 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 506743916 ns 424847437.5 ns 1.19
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 424015542 ns 425274250 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 594637416 ns 682175395.5 ns 0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12483759 ns 12478951 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1878936437.5 ns 1889075833 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1662067875 ns 1625727875 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1496755770.5 ns 1494457604.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2214230167 ns 2214128083.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49527395 ns 49385566.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1663166 ns 1647625 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1177833 ns 1201312.5 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1370041 ns 1376271 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2349521 ns 2354042 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217522 ns 214603 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12726750 ns 12810437.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 10036417 ns 9968417 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9643083 ns 9702395.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18397833 ns 18320249.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2037123 ns 2015837.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17723584 ns 17772083 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14827916 ns 14741771 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14555416.5 ns 14583292 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21415041 ns 21392208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26291 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26209 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23706 ns 23824 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67354.5 ns 67125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66792 ns 67500 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68375 ns 67958 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66875 ns 67000 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 393355.5 ns 377030.5 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203458 ns 204125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209417 ns 209792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210084 ns 210750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199125 ns 200292 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26245.5 ns 26462 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 647916 ns 650250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 672375.5 ns 625708.5 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621792 ns 669874.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 593542 ns 629250 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 351878.5 ns 303651 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 679750 ns 627292 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 657291 ns 671583 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 595709 ns 598312 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 632771 ns 639791 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131601.5 ns 132031 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2238750 ns 2336625 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2300791 ns 2255375 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2241896 ns 2235562.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2244958 ns 2236583 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1242570.5 ns 1129126 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18625 ns 18437 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17979 ns 18354.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18375 ns 20062.5 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17104 ns 17104.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144244 ns 144037 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 256458 ns 265000 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 245646 ns 230729 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221750 ns 231875 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 230416 ns 258875 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1056298 ns 929000 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 708 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 666 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23741 ns 23483 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9208 ns 10125 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9708 ns 9541 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9458 ns 10208 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9333 ns 9291 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 257592.5 ns 253535 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5125 ns 5958 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5500 ns 5417 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6395.5 ns 6166 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5458 ns 4959 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 231821.5 ns 177041 ns 1.31
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 7770.5 ns 0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 7250 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7458 ns 7875 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6917 ns 6917 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 801589.5 ns 724899.5 ns 1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2167 ns 2334 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2000 ns 2042 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2208 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2375 ns 2292 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17797 ns 17786 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6375 ns 6667 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6542 ns 6958 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6667 ns 6625 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6375 ns 6500 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 330267.5 ns 316064.5 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748708 ns 752459 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 756208 ns 746750 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 752750 ns 750791 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 753542 ns 746917 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 20724 ns 21186 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 792417 ns 794041.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 796875 ns 787583 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 786834 ns 810166.5 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 808000 ns 777749.5 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 297689.5 ns 292715.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7125 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5250 ns 6000 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 6042 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10125 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33074 ns 33031.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228604.5 ns 260583 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 251041 ns 266771 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227708 ns 240125 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226000 ns 213791 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 362298.5 ns 347920 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10209 ns 10417 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10209 ns 10083 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10458 ns 10666 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9750 ns 9770.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 252317 ns 236152.5 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25334 ns 24958 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24312.5 ns 24125 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25959 ns 25625 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24395.5 ns 24625 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1133104 ns 1075060 ns 1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106928354 ns 106687708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126898666 ns 118577083.5 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 121692334 ns 120497312.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117598792 ns 118064771 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2629460 ns 2612121 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 390743083 ns 394040917 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 379904750 ns 367160584 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 361277959 ns 357048666 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 481946125 ns 483172291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15184946 ns 15226002.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 754771020.5 ns 944093812.5 ns 0.80
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 597861750 ns 581088583 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748681771 ns 744439291.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 760209125 ns 770449312.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6500 ns 7833.5 ns 0.83
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6667 ns 6584 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8333 ns 7667 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6667 ns 6584 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 239111 ns 231298 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14125 ns 14833.5 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14125 ns 13833 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14437.5 ns 14333 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13667 ns 13667 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1073718 ns 1030746 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5542 ns 6812.5 ns 0.81
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5542 ns 6250 ns 0.89
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6395.5 ns 8625 ns 0.74
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5458 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 235877.5 ns 228035.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12208 ns 13541 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12542 ns 12250 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12750 ns 13417 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12166 ns 12375 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 781667 ns 749909.5 ns 1.04
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5709 ns 5562.5 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5437.5 ns 5917 ns 0.92
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 5750 ns 5959 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5833 ns 5625 ns 1.04
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16760 ns 17374 ns 0.96
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15417 ns 15979.5 ns 0.96
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15333 ns 15291 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15500 ns 15666.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15625 ns 15791 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 199275.5 ns 198865.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 417 ns 0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23515 ns 23594 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6333 ns 6770.5 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6167 ns 6416 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6417 ns 6875 ns 0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6333 ns 6416 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 240257 ns 238325.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5958 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5917 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6083 ns 5958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24789 ns 24848 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20958 ns 22291.5 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 20958.5 ns 21375 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21334 ns 21750 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21000 ns 21833 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 263523 ns 262151 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 188417 ns 145041 ns 1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 162166 ns 179792 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146708.5 ns 147000 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 149625 ns 145833 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167166 ns 167939 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1323812.5 ns 1367292 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1371958 ns 1334375 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1317937.5 ns 1330499.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1325562.5 ns 1319209 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1350174 ns 1299116 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25292 ns 23000 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22500 ns 24062.5 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23146.5 ns 24875 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22979.5 ns 21542 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 352259 ns 285873 ns 1.23
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 173645.5 ns 181458 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 180041 ns 142020.5 ns 1.27
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119500 ns 130312 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 126334 ns 166291 ns 0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1470411 ns 1432985 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23380 ns 24013 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6667 ns 0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6229.5 ns 6292 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6708 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns 6208 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 256300 ns 257668.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5084 ns 4875 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5083 ns 4541 ns 1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5083 ns 4917 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4292 ns 4334 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256465.5 ns 248170.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10209 ns 10541 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 9500 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10750 ns 10583 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10000 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1354750 ns 1315251.5 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns 1584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1708 ns 1625 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22916 ns 23770.5 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 6083 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5625 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6167 ns 6000 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5750 ns 5625 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 272343 ns 277301 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6820375 ns 6853687 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6368417 ns 6416292 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6567000 ns 6504750 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7648166 ns 7620312.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214879 ns 214867.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24083333.5 ns 24153125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21351687.5 ns 21320542 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21140875 ns 21047708.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29752125.5 ns 29760542 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2100360 ns 2095640.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37299645.5 ns 48863062.5 ns 0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34217771 ns 34327709 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45700125 ns 45697437.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38021000 ns 38239917 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5750 ns 6708 ns 0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5583.5 ns 5666 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6395.5 ns 6459 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5292 ns 5770.5 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 235350 ns 232386 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8167 ns 9062.5 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8416.5 ns 8375 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8375 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8291 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1060836 ns 1027676.5 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1566292 ns 1539229.5 ns 1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1237250 ns 1264500 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1619208 ns 1616916 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2132958 ns 2152625 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 278998 ns 281859 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7937625 ns 7990000 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6656917 ns 6612375 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7130604.5 ns 7167458 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10453333.5 ns 10472916.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1878437 ns 1870517 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 370292 ns 359666 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 353124.5 ns 372896 ns 0.95
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 459083 ns 456458 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 23666 ns 22396 ns 1.06
batchedmm(128, Bsize=4)/forward/GPU/CUDA 42541.5 ns 47625 ns 0.89
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 753083 ns 739666 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 809125 ns 822937.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1063125 ns 1053333 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 116979.5 ns 109291 ns 1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 239130.5 ns 240230 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397291 ns 396792 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 212417 ns 288042 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288125 ns 287917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 752000 ns 755250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44180 ns 45350 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 667583 ns 639083 ns 1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 474167 ns 531000 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 531812.5 ns 531625 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973083 ns 973083 ns 1
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 194058 ns 194303 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 678250 ns 636645.5 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 667145.5 ns 636021 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 621709 ns 652063 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646959 ns 654042 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133035 ns 133147 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2484229 ns 2499458 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2543916.5 ns 2456708 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2480312.5 ns 2459542 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2471875 ns 2452854 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1215811 ns 1214588 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 2791 ns 2209 ns 1.26
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2084 ns 3041 ns 0.69
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4333 ns 4667 ns 0.93
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3354 ns 2792 ns 1.20
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16281.5 ns 16731 ns 0.97
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5375 ns 5625 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5209 ns 5333 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5500 ns 5625 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5584 ns 5584 ns 1
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 201076.5 ns 199833.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1457583 ns 1461916.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1497084 ns 1505708 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1498833 ns 1503458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1436500 ns 1437083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41204 ns 41276 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5117834 ns 5154479 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5304542 ns 5307146 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5300500 ns 5288209 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4807333 ns 5001917 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 199725 ns 200453 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 32858 ns 34571 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15250 ns 15250 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15000 ns 15250 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15292 ns 15375 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15083 ns 15125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 377713 ns 372573 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70792 ns 71375 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71417 ns 71583 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71125 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 70000 ns 71250 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113374.5 ns 114012 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318333 ns 325917 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 334916 ns 325167 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318083 ns 318375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318209 ns 317750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 193117.5 ns 199225 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 959 ns 1000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23866.5 ns 24050 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7833 ns 8584 ns 0.91
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7875 ns 8084 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8375 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7875 ns 8000 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 261797 ns 262017.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 512646 ns 497584 ns 1.03
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 479541 ns 490208.5 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 566104 ns 559959 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 216667 ns 148209 ns 1.46
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130101 ns 129838.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1405541 ns 1405375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1481750 ns 1471875 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1758666 ns 1758791.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 872625 ns 869583 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274250.5 ns 274551 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31596 ns 32490 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6875 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 5854.5 ns 6208 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6541 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6042 ns 6333 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 263141.5 ns 265808 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1731916.5 ns 1723271 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1768000 ns 1751146 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1725583 ns 1734270.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1724459 ns 1724083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168363 ns 169537.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4401542 ns 4419270.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4406313 ns 4365292 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4361083 ns 4351792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4360083 ns 4357792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1173884.5 ns 1171701 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6583 ns 6833.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6791 ns 7062.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7062.5 ns 7833 ns 0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6791 ns 6833 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20597 ns 20938 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32792 ns 72249.5 ns 0.45
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 62083 ns 51291.5 ns 1.21
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33292 ns 52875 ns 0.63
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51084 ns 51333 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 293465.5 ns 211685.5 ns 1.39
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 18000 ns 17709 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17458 ns 18333 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 17916 ns 18312.5 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 18042 ns 17625 ns 1.02
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18220 ns 18852 ns 0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53250 ns 53583 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53292 ns 52958 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53583 ns 53625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53416.5 ns 53417 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 340467.5 ns 337333.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75333 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75417 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75292 ns 75334 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74833 ns 75333 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46370 ns 47609 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324292 ns 339875 ns 0.95
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 342291.5 ns 332958 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 336708 ns 325791 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324667 ns 324042 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 208689 ns 215842 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1483500 ns 1486125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1520542 ns 1530958 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1528333 ns 1527584 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1461958 ns 1463416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51330 ns 52815 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5116916.5 ns 5149209 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5306417 ns 5312291.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4956417 ns 5298250 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4985125.5 ns 4995000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 204511 ns 207728 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24159 ns 24971.5 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66584 ns 66292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66208 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67583 ns 66209 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66208 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 518001 ns 510271 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1500667 ns 1349333.5 ns 1.11
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 935916 ns 1135833 ns 0.82
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1063395.5 ns 1132458 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2253583 ns 2196062.5 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585024 ns 589889 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3089125 ns 3042333 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2661333 ns 2731792 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2581104 ns 2726167 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3818625 ns 3811625 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1992242 ns 2004374 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7906625 ns 8038292 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8031000 ns 7942499.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7927541.5 ns 7931979.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4820333 ns 4817250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 134041 ns 80499.5 ns 1.67
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81459 ns 82250 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82833 ns 82500 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81833 ns 80479.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194356 ns 194209 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2010167 ns 2050042 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2043167 ns 2034333.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2009750 ns 2017875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026792 ns 2018854 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 794414 ns 768336 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.