Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update default rng for reactant #1152

Merged
merged 2 commits into from
Jan 1, 2025
Merged

fix: update default rng for reactant #1152

merged 2 commits into from
Jan 1, 2025

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Jan 1, 2025

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 5db43a7 Previous: 63d3434 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4166 ns 4083.5 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3979.5 ns 4042 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5083 ns 4917 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3812.5 ns 3833 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 62377 ns 59941 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10458 ns 11250 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10542 ns 10500 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10917 ns 11541 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10958 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 423465.5 ns 421187 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1167 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1334 ns 1250 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1417 ns 1417 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1167 ns 1167 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 17673 ns 17939 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4167 ns 4125 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 3958 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4458 ns 4292 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3916 ns 4062.5 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 108782.5 ns 108432 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58666 ns 57333 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46208 ns 46250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46500 ns 47041 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82000 ns 82125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37058 ns 36736 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2057542 ns 1991000.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2088812.5 ns 2094313 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2085708 ns 2094167 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000917 ns 1997041.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196173 ns 194384.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 188166.5 ns 143854.5 ns 1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146417 ns 143125 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 168667 ns 147041 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144583 ns 144750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166189 ns 165602 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1152084 ns 1114896 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1114937 ns 1128937.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1113917 ns 1128792 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1124792 ns 1114542 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526984 ns 526049 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3958 ns 3458 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3791 ns 3416 ns 1.11
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4292 ns 4145.5 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3209 ns 3584 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 66944 ns 70040 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9292 ns 8917 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10458 ns 9042 ns 1.16
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9625 ns 9459 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9000 ns 8917 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 476984 ns 447136 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15666.5 ns 15041 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15250 ns 17541.5 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17375 ns 17625 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15500 ns 15917 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54731 ns 54471 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226333 ns 217417 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215770.5 ns 213417 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214541.5 ns 214979.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214771 ns 225771 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 273572 ns 270355 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 750 ns 791 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 750 ns 625 ns 1.20
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 708 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 667 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17271.5 ns 17190 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1500 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1500 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1750 ns 1666 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1500 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 101757 ns 101385 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7208 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5916 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 5917 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 9875 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23740 ns 23163 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233042 ns 223083 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229417 ns 228500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230042 ns 230208 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214167 ns 217000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 168563 ns 166961 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23692 ns 23600 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16875 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17167 ns 16750 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16958 ns 17041 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 17000 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 161637.5 ns 161078 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 604292 ns 577750 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 572625 ns 572709 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 576208 ns 574833 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574666 ns 575625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113368 ns 112893 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1449125 ns 1420292 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1425458 ns 1425209 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1417250 ns 1426583 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422625 ns 1429020.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 212042 ns 211317.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1083896 ns 1077500 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 955958 ns 960792 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1346958 ns 1350854.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1303396 ns 1298750 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 270063.5 ns 273506 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5906562.5 ns 6004937.5 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4521396 ns 4547292 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4939270.5 ns 4929708.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5518541 ns 5555333 ns 0.99
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1072796 ns 1074648 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23647 ns 23430 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2250 ns 2167 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 2167 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2166 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 170300.5 ns 173597 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4583 ns 4292 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3833 ns 3750 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5000 ns 4917 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3854.5 ns 3958 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65114 ns 65160 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12000 ns 11209 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11416 ns 11250 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12166 ns 12208 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11042 ns 11125 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 449470 ns 447745.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7270.5 ns 6166 ns 1.18
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6084 ns 6375 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7709 ns 8125 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6041 ns 6583 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52133.5 ns 52163 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17709 ns 16750 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17750 ns 18209 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19167 ns 18500 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17459 ns 17000 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 299326.5 ns 298259.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 667 ns 583 ns 1.14
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32714 ns 32532 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 8208 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8709 ns 8667 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9584 ns 9333 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8292 ns 8083 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 160130 ns 158900.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 65125 ns 64500 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64333 ns 64500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64416 ns 64458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64500 ns 64375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112090.5 ns 111633.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 284417 ns 274542 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 283229.5 ns 287042 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 276375 ns 274708 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 274250 ns 280292 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 186688.5 ns 186083 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3285312 ns 3329333 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3024209 ns 3017229 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3022833 ns 3024687.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4057542 ns 3956250 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 576630.5 ns 577429 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7680958 ns 7623958 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7452250 ns 7210334 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7441167 ns 7453270.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8201292 ns 8209375 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1341272 ns 1359043.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17502166.5 ns 17513124.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17538625 ns 17530146 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17554667 ns 17518395.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14123291 ns 14128813 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23611750.5 ns 23645979.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34023937.5 ns 33821104.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37089000 ns 37080041 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34892958.5 ns 34888834 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1859854 ns 1866294 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188302208 ns 189046208 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 163879813 ns 164619624.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152774791.5 ns 152711479 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 433871375 ns 436948083 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13921725.5 ns 13894254.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289641667 ns 289373791 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250562000 ns 251042625 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297502250 ns 296809167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 473866146 ns 474994229.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24208 ns 22250 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21875 ns 24542 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24834 ns 23188 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21625 ns 22417 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 97906 ns 96027 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104333.5 ns 116584 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 105000 ns 113125 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105041.5 ns 117833 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 102667 ns 103854 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 513627.5 ns 510213 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6708 ns 5833 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5708 ns 5917 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6667 ns 6812.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5937.5 ns 6292 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68769 ns 68158.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14959 ns 14875 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15459 ns 14812.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15667 ns 14875 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14750 ns 15042 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 486570.5 ns 478636.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3028666.5 ns 3009146 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2062834 ns 2061334 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2293375 ns 2279208 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4537375 ns 4871541.5 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 588519 ns 589315.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23569083 ns 23547375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17970166 ns 17982875.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16984875 ns 16893209 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34970875 ns 34849958 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2769808.5 ns 2772744 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33513292 ns 33314834 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27545208.5 ns 27464208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27453833 ns 27410208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 40974459 ns 41078500 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74167 ns 72375 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 71979.5 ns 74375 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75167 ns 75166 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72292 ns 75167 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104951 ns 102682 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 308750 ns 286145.5 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 321771 ns 210021.5 ns 1.53
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 321167 ns 315000 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219499.5 ns 218458 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 547081 ns 553543 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12083 ns 11875 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12041 ns 11708 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12292 ns 13334 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11645.5 ns 13125 ns 0.89
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 69950.5 ns 71259 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26708 ns 26833.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27417 ns 26375 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27333 ns 27417 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26750 ns 25854.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 473806 ns 477064.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12958 ns 12041.5 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12125 ns 12229.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13542 ns 13958 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12083 ns 12584 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52821 ns 53895.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26208 ns 25875 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26417 ns 25834 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26167 ns 26125 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26083 ns 25667 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301072.5 ns 305285 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179709 ns 179417 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182000 ns 179417 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 182875 ns 181041 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179708 ns 180042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57140 ns 58113 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584541.5 ns 590084 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585375 ns 585083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 594791 ns 591062.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583104 ns 584333 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 289093 ns 289662.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6541 ns 6083 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6042 ns 5500 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6667 ns 7542 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5625 ns 6604.5 ns 0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70191 ns 70599 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14334 ns 14291 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14958 ns 14209 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14770.5 ns 14917 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14166 ns 13062.5 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 463731.5 ns 466681.5 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1233354 ns 1223541.5 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1241354 ns 1236625 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1307584 ns 1285666.5 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1016334 ns 1007959 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302150 ns 301986 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4139000 ns 4226959 ns 0.98
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4384875 ns 4384249.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4566250 ns 4572312.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3703750 ns 3695104.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1038963 ns 1047036 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23482 ns 24200 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5000 ns 4875 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4959 ns 4875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 188186.5 ns 192268.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6084 ns 5458 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6000 ns 5542 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7000 ns 6791.5 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5458 ns 5792 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54827 ns 56595.5 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11541 ns 10500 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11042 ns 10416 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11541 ns 11375 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10708 ns 10875 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 328929.5 ns 335979.5 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 334 ns 0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 334 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23065 ns 23172 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2834 ns 2833 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2709 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 3042 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2791 ns 2791 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 159599.5 ns 162255.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11875 ns 11084 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11270.5 ns 11000 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12895.5 ns 13563 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11292 ns 11458 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 56968 ns 58685.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25333 ns 24542 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25062.5 ns 24542 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25458 ns 25167 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24750 ns 25000 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 287900 ns 298266 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4250 ns 4208 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24679 ns 25307 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16625 ns 16166 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16125 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16625 ns 16334 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16208 ns 16084 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 195350.5 ns 199542 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5709 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5791 ns 5917 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5834 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33389 ns 33833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21000 ns 20292 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20834 ns 20375 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21083 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21000 ns 20250 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 175063 ns 178083 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 424229.5 ns 420500 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 382042 ns 372625 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 477708 ns 482833 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 104666.5 ns 103292 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66983 ns 67723.5 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 882187.5 ns 922417 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 977667 ns 955208.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1179166.5 ns 1180875 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 383458 ns 379083 ns 1.01
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 188560 ns 192988 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80167 ns 136917 ns 0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80458 ns 79854.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83104 ns 82750 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80812.5 ns 81167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193407.5 ns 194081 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1950292 ns 1915042 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1903208 ns 1919750 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922208 ns 1926125 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1919104 ns 1915750 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 398056.5 ns 401908.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21900 ns 22364 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 169251 ns 174295 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7000 ns 6042 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6000 ns 6500 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7645.5 ns 7812.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6541 ns 6541 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 58682 ns 61489.5 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9584 ns 9000 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 8792 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9542 ns 9375 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9000 ns 9459 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 299860 ns 308375 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120365479 ns 118419979.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174058791 ns 173770000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148250542 ns 148397083 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106407666 ns 104919541 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5471253 ns 5493586 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 611069708.5 ns 611739750.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 551988959 ns 553521958 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 449773604.5 ns 449841709 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 629581291.5 ns 631089333.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38220940.5 ns 38209825 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 662619541 ns 652096250 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 664048916.5 ns 661126562.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 590023604 ns 580970687.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 859382500 ns 848782167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 60333 ns 58667 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 47500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47459 ns 48250 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84416 ns 83625 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37873 ns 37628 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1951062.5 ns 1919312.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1961417 ns 1980333.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976854.5 ns 1982541.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1887041.5 ns 1895625 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175346 ns 176341 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 266583 ns 266208 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 265916 ns 265334 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 288687.5 ns 288604 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267875 ns 268167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130934 ns 130454.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 694458 ns 664646 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 699271 ns 671062.5 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 693292 ns 665875 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 595500 ns 597542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 689357.5 ns 690208 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2189917 ns 2192312.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2219750 ns 2179542 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2218583 ns 2181333.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2179271 ns 2207146 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134898 ns 134808 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5543688 ns 5469791 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5493750 ns 5472958.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5499792 ns 5499916 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5500125 ns 5442583.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 773264.5 ns 720984 ns 1.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 648250 ns 644667 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 636583 ns 644084 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 639917 ns 642042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 643375 ns 644167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47646 ns 47636.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1859250 ns 1819917 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1718062 ns 1720500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1731958 ns 1721792 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2102291 ns 2100000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 227022 ns 224071 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59208 ns 57667 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46917 ns 46666 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47187.5 ns 46583 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84750 ns 83750 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 29453 ns 28795 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2049209 ns 2029583 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082750 ns 2087375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2084729.5 ns 2087791.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003209 ns 1991416.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 193498.5 ns 190320 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13383917 ns 13371041.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12429000 ns 12439187.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12553500 ns 12491875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15134250 ns 15195833.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 511588 ns 516777 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47363459 ns 47119104.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41713041 ns 41727062.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40957334 ns 41051417 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58315270.5 ns 58599458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2895504 ns 2892052.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97529916.5 ns 74212666 ns 1.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90590375 ns 67877750 ns 1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90429374.5 ns 90536499.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76443542 ns 98549792 ns 0.78
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59583 ns 58375 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46334 ns 46459 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47375 ns 47708 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82666 ns 83958 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47291 ns 47165 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1947979.5 ns 1919583.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1961250 ns 1980791 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1973271 ns 1979229.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1897250 ns 1886958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192219 ns 193816.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 417 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32079 ns 32624 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 5833 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6083 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6416.5 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5937.5 ns 5833 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 170833.5 ns 171378.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32116 ns 32204 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2583 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2625 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2667 ns 2625 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 160054.5 ns 159764 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 283749812 ns 286393770.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 341288375 ns 340253500 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314539291.5 ns 313806270.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271510208 ns 268566520.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7104228 ns 7103110 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1010627500 ns 1012043792 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 956683750 ns 955581708 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 852193146 ns 855297583 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1261954792 ns 1259239875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33827664 ns 33847341 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1667661584 ns 1418325958.5 ns 1.18
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1666216916 ns 1338395020.5 ns 1.24
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1606436083 ns 1636087292 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1366973687.5 ns 1775858125 ns 0.77
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1403291 ns 1409833 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1427917 ns 1414458.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1420291.5 ns 1465562.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1407083 ns 1413458.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127686 ns 127951 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5077374.5 ns 5027250 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5020479 ns 5036354 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5013208.5 ns 5030437.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011896 ns 5027250.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 590954 ns 479205.5 ns 1.23
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 169989042 ns 170869291 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 124556708 ns 128735708 ns 0.97
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 117703542 ns 105431542 ns 1.12
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 167773041 ns 167706958 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4847276 ns 4877746.5 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 628606625 ns 511068334 ns 1.23
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 495985459 ns 490911792 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 384929709 ns 385742875 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 647293458 ns 650161000 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16634773 ns 16340937 ns 1.02
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8966583.5 ns 9003042 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8908167 ns 8983042 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7918042 ns 7909375 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9735042 ns 9604229.5 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1592278 ns 1611438.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36520208 ns 36334167 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37244833 ns 37265291.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33588209 ns 33553354 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37756875 ns 37555333 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6456339 ns 6454550 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47520.5 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47625 ns 47500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47875 ns 47625 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47666 ns 47417 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18617 ns 18252 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50459 ns 50417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50500 ns 50666 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50500 ns 50625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50458 ns 50250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 207525 ns 164880 ns 1.26
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7084 ns 6417 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6291 ns 6792 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7667 ns 7583.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7083 ns 6792 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 102649 ns 76692.5 ns 1.34
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10375 ns 10125 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9917 ns 9750 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10459 ns 10250 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 9875 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 608896.5 ns 448214.5 ns 1.36
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6229.5 ns 5666 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5791 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7292 ns 7583 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6042 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 127225 ns 81735 ns 1.56
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13625 ns 13208 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13208 ns 12709 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14291 ns 13375 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13333 ns 13417 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 504128.5 ns 399198.5 ns 1.26
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1084 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33015 ns 32447 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8292 ns 7666 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7875 ns 7708 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 7958 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7958 ns 8166 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 206728 ns 187787.5 ns 1.10
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23583 ns 23167 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23500 ns 23209 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23792 ns 23250 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23584 ns 23292 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18598 ns 18320.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52667 ns 52917 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52625 ns 52167 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52875 ns 52917 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52792 ns 52875 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 277051.5 ns 214503.5 ns 1.29
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1457958 ns 1398125 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1399791 ns 1402146 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1407542 ns 1406437.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1404458.5 ns 1448937.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196597 ns 196187.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5037333 ns 5003458 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4999291.5 ns 5029708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5017000 ns 5015042 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011541.5 ns 5005729.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 588570.5 ns 509817 ns 1.15
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3068291.5 ns 3051834 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2088000 ns 2076520.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2298792 ns 2302500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4816145.5 ns 4658291.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 579682.5 ns 581685 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24422833.5 ns 24315708 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18831541.5 ns 18877250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17767708 ns 17822166 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35768979 ns 35790999.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2840747 ns 2842698 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34254979.5 ns 33982916.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28263375 ns 28228208.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28014875.5 ns 27940958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41850458 ns 41757334 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144995833 ns 143078500 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 146762458.5 ns 146668125 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126155437.5 ns 127355624.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 172892333 ns 171841729.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22763599 ns 22550146 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1302781041.5 ns 1234730083.5 ns 1.06
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1105540874.5 ns 1060723417 ns 1.04
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 818915959 ns 1027004875 ns 0.80
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 672266208 ns 674561583 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118087389 ns 117659213 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72667 ns 74125 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72541 ns 73146 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 81041 ns 76000 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 85000 ns 85834 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 215492.5 ns 175925 ns 1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 297062.5 ns 215750 ns 1.38
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 285541.5 ns 192541.5 ns 1.48
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 287041.5 ns 284542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 287958 ns 285708 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1143307 ns 952026.5 ns 1.20
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35739792 ns 35486000 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36309625 ns 36428646.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32471312.5 ns 32475229 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40331146 ns 40408041.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5837713 ns 5831517 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148519833 ns 146000771 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151197291.5 ns 154808750 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 135647708.5 ns 137043083.5 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 285564667 ns 285556542 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34868506.5 ns 34852076.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121927770.5 ns 121592083 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173726125 ns 174639125 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147938542 ns 148027541 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106465209 ns 105917833 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5469946 ns 5344344 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469829834 ns 468650958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466404000 ns 466713000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 440435500 ns 437158458 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 742141416 ns 744371959 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35183973.5 ns 35992005 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 708884520.5 ns 712765167 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 638944062.5 ns 641204167 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 620145875.5 ns 624084979.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 853299834 ns 856208084 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1349459 ns 1270583 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 987708 ns 995709 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 972125 ns 995875 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2058666.5 ns 2037625 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 571992 ns 569478 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 3012354.5 ns 2961229.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2612541.5 ns 2647792 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2622125 ns 2621500 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3696542 ns 3709750 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1700920.5 ns 1587708.5 ns 1.07
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5829708 ns 5785812.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5779417 ns 5824083 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5792500 ns 5785375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2892874.5 ns 2904896 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416 ns 7250 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 6125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 6042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25317 ns 24479.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225041 ns 223812.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 231875 ns 222667 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220625 ns 220792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214333 ns 240666 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 240148.5 ns 212315.5 ns 1.13
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 297383209 ns 296229125 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 218983500 ns 216728584 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 195262458 ns 190254604.5 ns 1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 306663354 ns 304954521 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7666475 ns 7671461.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1233887896 ns 1229817167 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 895083125 ns 902846291.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 815769750 ns 824304209 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1143449395.5 ns 1157856750.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26736681 ns 26996841 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5791 ns 5292 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4917 ns 5291.5 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6583 ns 6375 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5542 ns 5250 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 147831.5 ns 112898 ns 1.31
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7791 ns 6875 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7166 ns 6958 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7583 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 7125 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 563806 ns 535221.5 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 666 ns 584 ns 1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23941 ns 23660 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9459 ns 8625 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9209 ns 9084 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10208 ns 9417 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9000 ns 8708 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 201384.5 ns 195936.5 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 356459 ns 352958.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 353125 ns 352792 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352041 ns 351479 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 354041 ns 356708.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21182 ns 20962 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 834521 ns 775625 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 777562.5 ns 825833 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 812209 ns 812229.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 830208 ns 834959 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 253346 ns 234827 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 341437.5 ns 341562.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 346041 ns 341958 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451667 ns 455917 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 11104.5 ns 11083 ns 1.00
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17955 ns 17699 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 718625 ns 712500 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 732125 ns 739896 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1003979 ns 1007854 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 27625 ns 26459 ns 1.04
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 227406.5 ns 214680.5 ns 1.06
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 378875 ns 381042 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 350708 ns 346750 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 442208 ns 449187.5 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 31583 ns 39042 ns 0.81
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22686 ns 22537 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 737208 ns 733792 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 780083 ns 788958 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1026792 ns 1032500 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 104333 ns 105583 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 206514.5 ns 200835.5 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 3791 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3667 ns 3541 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3792 ns 3708 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3541.5 ns 3708 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17783 ns 17542 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4250 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4209 ns 4167 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4375 ns 4250 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4167 ns 4250 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 234795 ns 204574.5 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 3834 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3541 ns 3667 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4250 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3625 ns 3625 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 190263 ns 160115.5 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8667 ns 8292 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 8166 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8604.5 ns 8458 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208.5 ns 8333 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1137360 ns 989699 ns 1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 205416 ns 203375 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211792 ns 212791 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 215417 ns 210666 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201417 ns 200834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34517 ns 34428 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 645937.5 ns 652624.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 629292 ns 622667 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 631916.5 ns 631604.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 629958 ns 632750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 324473.5 ns 280400.5 ns 1.16
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1023208 ns 994229.5 ns 1.03
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1019708 ns 1040292 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 953687.5 ns 956020.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 859854 ns 853917 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206050 ns 208023.5 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4569708 ns 4502437.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4685750 ns 4668229.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4464417 ns 4455084 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4253187.5 ns 4280937 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 925934 ns 935555 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 3292 ns 1.22
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3125 ns 3458 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3958 ns 4042 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3167 ns 3209 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 204899.5 ns 159049 ns 1.29
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7291 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7333 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7334 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 6833 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 966567 ns 850635.5 ns 1.14
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1646958 ns 1640041 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1166750 ns 1196604.5 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1367791.5 ns 1383250 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2480916 ns 2417500 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212410 ns 215018 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12414979.5 ns 12333396 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9580459 ns 9592791.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9268479.5 ns 9267625 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18065125 ns 18011459 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1952065 ns 1959459 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17401166 ns 17332937.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14384792 ns 14386792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14373584 ns 14369396.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21058416 ns 21112291.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 136959 ns 87708 ns 1.56
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 88187.5 ns 88542 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93625 ns 92833 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 132479 ns 116000 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126700 ns 126352.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2048041.5 ns 2022959 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1852917 ns 2049666 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2026645.5 ns 2035562.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026917 ns 2025938 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 986444.5 ns 878938 ns 1.12
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 4083.5 ns 2750 ns 1.48
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2917 ns 3209 ns 0.91
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3000 ns 3417 ns 0.88
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 2208 ns 2792 ns 0.79
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16657 ns 16283 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2875 ns 2542 ns 1.13
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2708 ns 2708 ns 1
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3000 ns 2875 ns 1.04
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2792 ns 2834 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 184699.5 ns 176848 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7500 ns 7083 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5959 ns 6000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6041 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34237 ns 34134 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226625 ns 221583 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229917 ns 220000 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220083 ns 220417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216458 ns 215333 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 327734.5 ns 285763.5 ns 1.15
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22866 ns 22875 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14709 ns 14500 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14459 ns 14375 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14666 ns 14458 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14458 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 457315.5 ns 410580 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 138584 ns 92125 ns 1.50
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91416.5 ns 92916 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97042 ns 96979 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 140937 ns 138000 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126314 ns 125660 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1957979.5 ns 1923792 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1909167 ns 1935291 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1924104 ns 1932916.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923208 ns 1920500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 889594 ns 861874.5 ns 1.03
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 882875 ns 873916 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 825333.5 ns 826583 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1222459 ns 1222000 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 968812 ns 963750 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 275152 ns 276546 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2832250 ns 2791083 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2436875 ns 2445687.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3346084 ns 3347916 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3408479 ns 3371375 ns 1.01
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1574637 ns 1487194.5 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17917 ns 17250 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 14708 ns 17959 ns 0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16292 ns 17875 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15416 ns 17417 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132847.5 ns 130892 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261666 ns 218625 ns 1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215875 ns 260667 ns 0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227834 ns 227792 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255708 ns 256083 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 611860.5 ns 584591.5 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222416 ns 222000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221458.5 ns 222667 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 220979 ns 222312.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221583 ns 220833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 248634.5 ns 243596.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 577583 ns 501417 ns 1.15
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 510958.5 ns 496084 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 528041.5 ns 508541.5 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 518834 ns 561833 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1309411.5 ns 1202534 ns 1.09
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4125 ns 3895.5 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 3459 ns 4270.5 ns 0.81
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5041 ns 5708 ns 0.88
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4084 ns 4458.5 ns 0.92
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17359 ns 16584 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7500 ns 7208.5 ns 1.04
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7333 ns 7000 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7333 ns 7625 ns 0.96
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7417 ns 7500 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 183645 ns 179332 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20500 ns 17687 ns 1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16479.5 ns 17917 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19063 ns 18625 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19833 ns 18729 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145474 ns 135434 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221208 ns 211041 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212375 ns 220417 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214166.5 ns 212542 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224041.5 ns 212271 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 906684.5 ns 847267 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4312.5 ns 3959 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3916 ns 4209 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4708 ns 4875 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3833 ns 4291 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 198063 ns 187480.5 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10750 ns 10459 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11020.5 ns 10541.5 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11042 ns 10042 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10167 ns 10125 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 997044 ns 955985 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3708 ns 3145.5 ns 1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3250 ns 2937.5 ns 1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4000 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3166 ns 3167 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 217373.5 ns 188520.5 ns 1.15
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7834 ns 7375 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7834 ns 7209 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7625 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7333 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1011885.5 ns 987324 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23840875 ns 23406938 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35027854 ns 35765125 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37123250 ns 37705500 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34879542 ns 34946604 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1843048 ns 1830206.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184325375 ns 183995333 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 162219583 ns 165575375 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 145745916.5 ns 146468292 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 274832375 ns 274483625 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16494818.5 ns 16521685 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 274529917 ns 276817937 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253641458.5 ns 246377395.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231405354 ns 231576042 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 324348208.5 ns 325032833.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184625 ns 182896.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182292 ns 184292 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185312.5 ns 184958 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184083 ns 183167 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 202083.5 ns 200810.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 645750 ns 635333 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 603542 ns 633354.5 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 599333 ns 600291 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631333 ns 597271 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 985206 ns 958799 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3919209 ns 3842750 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3919667 ns 3997500 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3545875 ns 3542792 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4578250 ns 4556625 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 536849 ns 532425 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17586375 ns 17396104 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17890250 ns 18078958 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16507834 ns 16589917 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19939000 ns 19981167 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2632687 ns 2633170 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32135 ns 32094 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9625 ns 8917 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 8750 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9583 ns 9041 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8917 ns 9042 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 248690.5 ns 249030 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 653687729.5 ns 652464437.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 392525041.5 ns 394034604 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 328748959 ns 326393417 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 745842709 ns 748745833 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12472241 ns 12466975 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1889224479.5 ns 1885107791.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1635009041 ns 1638827875 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1509526291.5 ns 1512914354 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2196907667 ns 2208603583.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49240921 ns 49231175.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1653562.5 ns 1616792 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1162084 ns 1200917 ns 0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1386645.5 ns 1389625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2484916 ns 2477916.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215030.5 ns 215338 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12723500 ns 12691834 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9937604 ns 9979354.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9655333 ns 9689896 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18319042 ns 18371271 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2044976 ns 1985308 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17717375 ns 17676916 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14639896 ns 14722000 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14564292 ns 14613667 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21414771 ns 21413395.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26500 ns 26292 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26291 ns 26291 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24435 ns 23721 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67542 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67208 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67792 ns 67209 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67083 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 375943.5 ns 367128.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203958 ns 203542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210917 ns 208625 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211667 ns 209584 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199917 ns 199792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26886 ns 25494 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 665791.5 ns 604625 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 634228.5 ns 670666.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 663125 ns 632166.5 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631979.5 ns 630000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 330848.5 ns 321975.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 660812.5 ns 639021 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 641541.5 ns 643458 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 644229 ns 658750 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 672667 ns 632750 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132735.5 ns 131332 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2283708 ns 2244229 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2228458 ns 2277708.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2242479.5 ns 2240167 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2245854 ns 2235458.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1161748 ns 1075922 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18458 ns 17167 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19041.5 ns 17916 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19583 ns 18167 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20042 ns 18208 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133846.5 ns 130720.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 263333 ns 258584 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232125 ns 227459 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 266084 ns 232750 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 260771 ns 230791 ns 1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 976903 ns 887768.5 ns 1.10
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 666 ns 625 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 708 ns 666 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 24021 ns 23104 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9979.5 ns 9750 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9875 ns 9250 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10083 ns 9208 ns 1.10
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9625 ns 9417 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 250362 ns 242418 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5792 ns 5208 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5250 ns 5125 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6667 ns 6375 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5583 ns 5375 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 210534.5 ns 193804 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8041 ns 7167 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7250 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7375 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7042 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 742239 ns 706410 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2167 ns 2125 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2125 ns 2250 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2375 ns 2209 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2375 ns 2208 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18253.5 ns 17672 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6875 ns 6458 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6375 ns 6291 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6708 ns 6709 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6500 ns 6500 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 309540.5 ns 300575 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 752500 ns 749459 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746584 ns 748959 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 751083.5 ns 750854 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749166.5 ns 749167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21684 ns 20805 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 791542 ns 775208 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 775083 ns 795916.5 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 795500 ns 792791 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 796417 ns 792792 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 276627 ns 274546.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7208 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5917 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 5959 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10250 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33842.5 ns 33244 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 267708 ns 219625 ns 1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 241125 ns 240291 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 233083 ns 237583 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 254500 ns 260042 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 334966 ns 337443 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10709 ns 10084 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10125 ns 9583 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11208 ns 10750 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9875 ns 10167 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 221448.5 ns 223296.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24625 ns 25125 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24854.5 ns 24312.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25125 ns 24917 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24833 ns 24667 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1054466 ns 1047460.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106442250 ns 106018062.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 118292812 ns 118144520.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120274792 ns 120409292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118694500 ns 117468833 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2661687 ns 2652084 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 393734750 ns 373672500 ns 1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 365034583 ns 359102771.5 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 354260000.5 ns 356068521.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 543249000 ns 543525042 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15218732.5 ns 15230726 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 784688333 ns 605345333 ns 1.30
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 760961125 ns 584604208 ns 1.30
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748037208.5 ns 744606604.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 614191479.5 ns 793208583.5 ns 0.77
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8292 ns 6500 ns 1.28
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6459 ns 6375 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8625 ns 8062 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6875 ns 7146 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 215167.5 ns 216878 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14208 ns 13625 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14334 ns 13625 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14708 ns 14125 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14167 ns 14084 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1007614.5 ns 1010131 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6166 ns 5625 ns 1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5375 ns 6000 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7542 ns 7895.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5959 ns 5958 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 209886 ns 211472.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13250 ns 12583 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12334 ns 12333 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13375 ns 12708 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12625 ns 12709 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 719093 ns 725788 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5541 ns 5583 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5208 ns 5875 ns 0.89
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6500 ns 6583.5 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5625 ns 6167 ns 0.91
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16998 ns 17002 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15792 ns 15916 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15458 ns 15250 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15667 ns 16125 ns 0.97
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15375 ns 15834 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 185360 ns 187784.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23196 ns 23531 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6167 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6292 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6791 ns 6459 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6062.5 ns 6084 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 225872.5 ns 228744 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5834 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 5959 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5959 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24541.5 ns 24273 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21416 ns 20833 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21500 ns 20750 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 21292 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21708 ns 21041 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 248924 ns 251207.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144166.5 ns 185375 ns 0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143812.5 ns 144625 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147187.5 ns 147917 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 191416 ns 144417 ns 1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167441 ns 166909.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1359562.5 ns 1321833 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324500 ns 1350479 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1321625 ns 1337166 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1323416 ns 1323625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1302768.5 ns 1251196 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24208 ns 24833 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21916 ns 25041 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25000 ns 23958 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24834 ns 24271 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 324085 ns 315591 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 174729 ns 131292 ns 1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 119604 ns 118396 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 179041.5 ns 176916 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 178833 ns 129458 ns 1.38
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1333498 ns 1353120 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23177 ns 23127 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6584 ns 6125 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6541 ns 6459 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6916 ns 6333 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6459 ns 6125 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 242684 ns 245064.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4771 ns 4208 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4583 ns 4875 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5250 ns 5125 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4583 ns 4667 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 227724.5 ns 228957.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10417 ns 9875 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10416 ns 9875 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10334 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10208 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1265317 ns 1285818.5 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1584 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1667 ns 1625 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23073 ns 23344 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 5750 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5875 ns 5709 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6000 ns 6000 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5667 ns 5666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 260330.5 ns 264086.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6829416.5 ns 6807541.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6373750.5 ns 6433375 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6530583 ns 6489875 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7628125 ns 7649521 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214315 ns 214938 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24071417 ns 24073959 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21308167 ns 21296000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20999125 ns 21044062.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29763520.5 ns 29805771 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2117367 ns 2104181 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48706709 ns 37247625 ns 1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45307041 ns 34089791 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45597625 ns 45725979.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38094041.5 ns 49397750 ns 0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6542 ns 5500 ns 1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5666 ns 5708 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6834 ns 6541 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5979.5 ns 5708 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 209381.5 ns 208256 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 8084 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8125 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8916 ns 8375 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8375 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 976901 ns 991485 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1572791 ns 1509000 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1254791 ns 1282542 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1624459 ns 1634916.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2178625 ns 2162000.5 ns 1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA 273695 ns 271116.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7953208 ns 7902209 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6560458 ns 6449312.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7149104 ns 7195708 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10472292 ns 10462229 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1791343 ns 1752716.5 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 369500 ns 371187.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 374937.5 ns 374208 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 455625 ns 461250 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 23937.5 ns 22208 ns 1.08
batchedmm(128, Bsize=4)/forward/GPU/CUDA 47162 ns 42428.5 ns 1.11
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 737042 ns 745437.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 809667 ns 815833 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1061750 ns 1062958 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 118833 ns 117396 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 286422.5 ns 283256.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397500 ns 397208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288042 ns 288667 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287958 ns 287875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750083 ns 750917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44452 ns 43636 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 676958 ns 667000 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532500 ns 531375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 530042 ns 531417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974250 ns 974083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 193188.5 ns 188745 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 683375 ns 644833 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 641167 ns 648750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 649375 ns 644479 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 673812 ns 652458.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132770 ns 131347.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2487917 ns 2445334 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2450875 ns 2500021 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2451584 ns 2463250 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2463958 ns 2463375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1259284 ns 1238313 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 4209 ns 3417 ns 1.23
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 3750 ns 3625 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4270.5 ns 4250 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 4667 ns 3437.5 ns 1.36
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16229 ns 16066 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5583 ns 5375 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5375 ns 5292 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5625 ns 5750 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5583 ns 5583 ns 1
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 186384.5 ns 182995 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1459666 ns 1458042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1502500 ns 1499750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1501291 ns 1503250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437500 ns 1437708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40589 ns 40191 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5161542 ns 5113291 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5284458 ns 5287958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5290208 ns 5307041.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4991020.5 ns 4985125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198827.5 ns 196599 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34039 ns 33557 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15416 ns 15125 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15167 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15500 ns 15416 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15209 ns 15208 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 356561 ns 349206 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71667 ns 71125 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71208 ns 71542 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71167 ns 71209 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71250 ns 71041 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 114103.5 ns 113114 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 321500 ns 317667 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318209 ns 324125 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318791 ns 318292 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318042 ns 317625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196364 ns 193277 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 958 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1041 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1125 ns 0.89
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24058 ns 23048 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7750 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8041 ns 8270.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8250 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8041 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 250219.5 ns 245757.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 511249.5 ns 502770.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 491875 ns 484500 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 563375 ns 561750 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 213208 ns 219917 ns 0.97
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129570 ns 129178 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1422313 ns 1387645.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1464542 ns 1473958 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1723104.5 ns 1779041.5 ns 0.97
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 871458 ns 862917 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 278287 ns 273950 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 417 ns 333 ns 1.25
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32706 ns 31657.5 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6791 ns 6125 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6208 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6334 ns 6541 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6062.5 ns 6042 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 254343.5 ns 251419 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1757021 ns 1733792 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1724500 ns 1721208 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1725541.5 ns 1724250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1778395.5 ns 1773541 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169310.5 ns 168671 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4406167 ns 4114542 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4294542 ns 4392834 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4359458.5 ns 4368208.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4362750 ns 4369208.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1097195 ns 1291475.5 ns 0.85
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6917 ns 6834 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6667 ns 6667 ns 1
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6917 ns 7999.5 ns 0.86
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6895.5 ns 7041 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20814 ns 20138.5 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51833 ns 51250 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 51187.5 ns 32625 ns 1.57
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48833 ns 73833 ns 0.66
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51875 ns 51084 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 199815.5 ns 340107 ns 0.59
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 18125 ns 17833 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17625 ns 18083 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18833 ns 18875 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 18042 ns 18208 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18986 ns 18400 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53167 ns 53250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53375 ns 53041 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53417 ns 53375 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53583 ns 53542 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 321788.5 ns 319083.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75708 ns 75166 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75292 ns 75625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 74958 ns 75291.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75333 ns 75083 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47411 ns 47469 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 330916.5 ns 324958 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 331041.5 ns 342000 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324375 ns 325000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324541 ns 324542 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 214495.5 ns 211595 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1486458 ns 1484959 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1529333 ns 1526854.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1528292 ns 1527250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463917 ns 1462542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52412 ns 51799 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5147625 ns 5111083.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5283458 ns 5312417 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5285042 ns 5299333.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4987292 ns 4982354 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 208778.5 ns 204934 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28209 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28333 ns 28187.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25229 ns 24742 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66750 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66583 ns 66709 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66500 ns 66500 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66750 ns 66541 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 504373.5 ns 484630.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1502708.5 ns 1480583.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1140542 ns 1136563 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1124125 ns 1136750 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2213416 ns 2265937.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 588762 ns 579622.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3119667 ns 3074562.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2731437.5 ns 2788145.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2740375 ns 2743021 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3821208 ns 3819500.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2025801 ns 1931643 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7950666 ns 7902458 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7904604 ns 7834062.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7893208 ns 7920375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4813854.5 ns 4826312.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81708 ns 77625 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80167 ns 81167 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83042 ns 84041.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 130291.5 ns 111396 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193963.5 ns 193746 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2045417 ns 2012875 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2014334 ns 2046292 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2016958 ns 2031354 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2019542 ns 2015417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 751114 ns 746361.5 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

github-actions bot commented Jan 1, 2025

Benchmark Results (ASV)

main 24c9a7f... main/24c9a7fb15e617...
basics/overhead 0.123 ± 0.0012 μs 0.123 ± 0.0012 μs 1
time_to_load 0.943 ± 0.029 s 0.894 ± 0.02 s 1.05

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal merged commit 367680b into main Jan 1, 2025
44 of 70 checks passed
@avik-pal avik-pal deleted the ap/random branch January 1, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random Numbers & Reactant
1 participant