Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: use [sources] in Project.toml #1090

Merged
merged 1 commit into from
Nov 17, 2024
Merged

chore: use [sources] in Project.toml #1090

merged 1 commit into from
Nov 17, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

Benchmark Results (ASV)

main 855ff5b... main/855ff5b7a25c2b...
basics/overhead 0.121 ± 0.0011 μs 0.12 ± 0.00074 μs 1.01
time_to_load 1.16 ± 0.0047 s 1.16 ± 0.0077 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal merged commit 2331c99 into main Nov 17, 2024
72 of 77 checks passed
@avik-pal avik-pal deleted the ap/sources branch November 17, 2024 19:06
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 855ff5b Previous: 3986545 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4084 ns 3792 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 4084 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5083.5 ns 4834 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4000 ns 3959 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60614.5 ns 61509.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10458 ns 10500 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10666 ns 10541 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11125 ns 10250 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10583 ns 10250 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 424464.5 ns 431498.5 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1250 ns 1062.5 ns 1.18
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1250 ns 1167 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1375 ns 1417 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1167 ns 1208 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18114 ns 18573 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4083 ns 4000 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3792 ns 4000 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4375 ns 4209 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3875 ns 3750 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109410 ns 111184 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57958 ns 57750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46375 ns 38542 ns 1.20
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38292 ns 46583 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81375 ns 82208 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37166 ns 37503.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2025083 ns 2037645.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2089000 ns 2095625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2116042 ns 1844375 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997083 ns 2001375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196065 ns 196039 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144167 ns 145583 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144459 ns 143584 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145167 ns 146458 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 145041 ns 145000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165775.5 ns 168190 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1117334 ns 1114291 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1113209 ns 1150292 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1161895.5 ns 805500 ns 1.44
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1119208 ns 1122750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 529787.5 ns 526921 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 3292 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3645.5 ns 3666 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4125 ns 4167 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3291 ns 3500 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 68193 ns 72235.5 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8917 ns 10125 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8708 ns 8375 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9375 ns 8792 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 8833 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 490128 ns 480020 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14875 ns 14875 ns 1
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15417 ns 15000 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18042 ns 17520.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14958 ns 14583 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54649 ns 53914 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213375 ns 214792 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213541 ns 214875 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214167 ns 214750 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215500 ns 226813 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 275612.5 ns 272785 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 917 ns 0.82
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 459 ns 1.36
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17498 ns 17774 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1667 ns 1792 ns 0.93
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1750 ns 1417 ns 1.24
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1916 ns 1709 ns 1.12
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1500 ns 1417 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103477.5 ns 102929.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7167 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5792 ns 5250 ns 1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 6000 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24119 ns 23666 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222875 ns 225187.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240875 ns 237479.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230208 ns 229334 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213854.5 ns 226709 ns 0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 171815 ns 168739 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3959 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3875 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23882 ns 23839 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16875 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16708 ns 16833 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16917 ns 16958 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16875 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 163766 ns 161365 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 573417 ns 571458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 577417 ns 576000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 573666 ns 574041 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 576291 ns 571458 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113520 ns 113559.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1423416 ns 1425375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1417833 ns 1418875 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1427333.5 ns 1418958 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1419833 ns 1422750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 212938 ns 210833 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1073958 ns 1076645.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 970834 ns 934291 ns 1.04
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1321166 ns 1340187.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1305000 ns 1294270.5 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 275497.5 ns 271656 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5791125 ns 5796417 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4570250 ns 4651792 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4945833.5 ns 4918209 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5716292 ns 5515938 ns 1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1090457 ns 1071316.5 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23889 ns 23948.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2209 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 171739 ns 169153 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4500 ns 3625 ns 1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4208 ns 4084 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5000 ns 4687.5 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3625 ns 3709 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66260 ns 66303.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11625 ns 11270.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11417 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12333 ns 11625 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11375 ns 10667 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 453873.5 ns 456550 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7604 ns 6312.5 ns 1.20
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6792 ns 6770.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8250 ns 7792 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5916 ns 7083 ns 0.84
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53506 ns 52528 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17375 ns 18375 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16958 ns 17833 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18042 ns 17791 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16667 ns 16833 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 305015.5 ns 301396 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 584 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33251 ns 32972 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9208 ns 9020.5 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8125 ns 8459 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9333 ns 9041 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8542 ns 8708 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 161099 ns 159042.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64458 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64583 ns 64895.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64875 ns 64292 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64583 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112304 ns 110877 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 279208.5 ns 284875 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 284708 ns 297937.5 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 293834 ns 282333 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 285833 ns 274104.5 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 187730 ns 184904.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3285249.5 ns 3295541 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3021125 ns 2811062.5 ns 1.07
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2778812.5 ns 3016125 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3955167 ns 3935209 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 579536 ns 572132 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7633895.5 ns 7478250 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7435145.5 ns 7348937.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7391584 ns 7339479.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8220833 ns 8212959 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1366510.5 ns 1367334 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18924416 ns 18775625 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 13838542 ns 19121334 ns 0.72
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19308875 ns 19108667 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15678458 ns 15653542 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23458500 ns 23560250 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33511625 ns 42472875 ns 0.79
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41458083 ns 37127771 ns 1.12
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35090271 ns 34865500 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1856909 ns 1862818 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189151500 ns 188025167 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164154750 ns 176960479.5 ns 0.93
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 160445604.5 ns 152823708 ns 1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 441242500 ns 441336000 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13896361 ns 13912250 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289233042 ns 290589750 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 337563458 ns 276449542 ns 1.22
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 307343396 ns 296753875 ns 1.04
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 412183687 ns 333259041 ns 1.24
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24334 ns 22875 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24416.5 ns 23333 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25167 ns 24125 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23917 ns 23542 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 97448 ns 98041.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103042 ns 103625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103520.5 ns 135834 ns 0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104750 ns 105084 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103083 ns 103250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 508014 ns 518052 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 6209 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6917 ns 6500 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6833.5 ns 7041.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5959 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68441 ns 70884 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14250 ns 15084 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15875 ns 15708 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 16250 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15041 ns 14770.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 473840 ns 492747 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3011021 ns 3001020.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2060083.5 ns 2085333 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2299209 ns 2274000 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4865083 ns 4550083 ns 1.07
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 586467.5 ns 589071 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23471917 ns 23511750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18065458 ns 18279542 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18230125 ns 16979209 ns 1.07
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35635667 ns 35598583 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2895396 ns 3111231 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33250875 ns 33266500 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27738541 ns 28064750 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28088500 ns 27365500 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41784729 ns 41824541.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 71875 ns 71750 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 84916 ns 74021 ns 1.15
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75083 ns 74875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72729 ns 73458 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104782 ns 104698 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 204229 ns 314125.5 ns 0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 207333.5 ns 212229 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 285666 ns 323000 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 232875 ns 218042 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 561543 ns 559024 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11708 ns 11625 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12417 ns 12292 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13000 ns 12500 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11750 ns 11875 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 72717 ns 73943 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26958 ns 26583 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27417 ns 26667 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28416 ns 27708 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27229.5 ns 26666 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 481917.5 ns 493150 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12625 ns 12208 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12958 ns 12896 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14000 ns 13916 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12167 ns 12500 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 54368 ns 54608 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26334 ns 26125 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26375 ns 26000 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26375 ns 25916.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27708 ns 26000 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 309823 ns 315887.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182042 ns 179208 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182250 ns 183145.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183604 ns 183166 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181500 ns 180125 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57447.5 ns 58575 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585499.5 ns 582958.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 594896 ns 596541.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586042 ns 583833 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582125 ns 582834 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 293208 ns 294599.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 6292 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6500 ns 6459 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6854 ns 6750 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6041 ns 6041 ns 1
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 72078 ns 72806 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14541 ns 14542 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14917 ns 13333 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16125 ns 15667 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14625 ns 14333 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 474091.5 ns 482192.5 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1281167 ns 1177728.5 ns 1.09
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1208125 ns 1356208.5 ns 0.89
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1280625 ns 1250750 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1322708 ns 1317541 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301459 ns 301448 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4101146 ns 4117688 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4348917 ns 4491417 ns 0.97
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4741500 ns 4696854.5 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4439333 ns 4452542 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1050882 ns 1051206.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1917 ns 1875 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23723 ns 24165 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4958 ns 5000 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4958 ns 4958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 189158.5 ns 194564.5 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5792 ns 6041 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6333 ns 6000 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 6145.5 ns 1.15
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5209 ns 5958 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56549.5 ns 57313.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11875 ns 11979.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11041 ns 11854.5 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11895.5 ns 11042 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10229.5 ns 11292 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 333873.5 ns 342366 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22961 ns 23004 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 3000 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2750 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3083 ns 3000 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 2750 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 158273 ns 159207 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11604.5 ns 11583 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11875 ns 11292 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12958 ns 13437.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11208 ns 11708.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 57126.5 ns 57286.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25000 ns 25312.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25167 ns 25083 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25250 ns 25334 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24917 ns 25167 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295873 ns 296722 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24965 ns 25099 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16250 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16166 ns 16041 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16125 ns 16166 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16042 ns 16042 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 196566.5 ns 199370.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5833 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5833 ns 5833 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34029 ns 33986 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20917 ns 21083 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21250 ns 21125 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20709 ns 21208 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21000 ns 20667 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 177138.5 ns 176941.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 397334 ns 396792 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 382166.5 ns 354313 ns 1.08
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 474896 ns 489167 ns 0.97
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 521625 ns 521584 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67116 ns 66831 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 985000 ns 1005417 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 855645.5 ns 876583 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1221812.5 ns 1235667 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1321458.5 ns 1420854 ns 0.93
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190549 ns 191762.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80250 ns 80250 ns 1
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80625 ns 80209 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81833 ns 84167 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80458.5 ns 81125 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192943 ns 193433 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924750 ns 1916083 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1908104 ns 1933854 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1708583 ns 1917917 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920583.5 ns 1923708.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 399820 ns 409629 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22084 ns 22197 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1875 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 169250 ns 170854.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6083 ns 6791 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7084 ns 6417 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7937.5 ns 7375 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6333 ns 6959 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60491 ns 61202 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9375 ns 9291.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9500 ns 9166.5 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9541 ns 9375 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9042 ns 9334 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 310109.5 ns 313492.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 118286646 ns 120748834 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174064625 ns 181703729 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 156462125 ns 148437750 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106578500 ns 104851584 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5473382.5 ns 5474996 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 614571333 ns 616853125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555423417 ns 579539270.5 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 470293000 ns 451846854.5 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 760588104.5 ns 757165312.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35131353 ns 34944567 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 650061834 ns 649889209 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665653562.5 ns 688661771 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 595687063 ns 592710229 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 744750500 ns 741917708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59167 ns 59750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46750 ns 38959 ns 1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38750 ns 48000 ns 0.81
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83459 ns 83416 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37444 ns 37459 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930479 ns 1922792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1983354 ns 1985083 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1816458.5 ns 1978104 ns 0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895083 ns 1893917 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175426.5 ns 174160 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 285562 ns 290625 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 280875 ns 266708 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268125 ns 271521 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 268209 ns 268167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130965 ns 132776.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 592333 ns 657229.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 692458 ns 681187.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 693521 ns 691583 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 601042 ns 597417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 701811.5 ns 713916 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2222375.5 ns 2243937 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2219833 ns 2191895.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2137416 ns 2213542 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2174875 ns 2180437.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133582 ns 133381 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5523042 ns 5496875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5512333 ns 5583292 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5594937 ns 5498250 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5490208 ns 5492750.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 738932 ns 753967 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 642209 ns 636833 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 637875 ns 644417 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 637625 ns 645333 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 639000 ns 637292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47189.5 ns 46993.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1826916 ns 1826042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1718042 ns 1667083 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1663958 ns 1726542 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2104375 ns 2105854.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 220556 ns 222295 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58875 ns 58500 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 45167 ns 38708 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38500 ns 47250 ns 0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83584 ns 84292 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28746 ns 28598 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027229 ns 2031041 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2081437.5 ns 2099020.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087875 ns 2091916.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1994833 ns 1856417 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190232.5 ns 190652 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13337625 ns 13391395.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12443750 ns 12453250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12593792 ns 12557375.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15201937 ns 15140541 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 517591 ns 514312 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47226542 ns 47481750 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41816250 ns 41986250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41295625 ns 40944792 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58300709 ns 57945917 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3048320 ns 3259544 ns 0.94
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97261354 ns 96867229.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68631000 ns 91436187.5 ns 0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91621375 ns 90591917 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99015250 ns 76381625 ns 1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59125 ns 59083.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47250 ns 38750 ns 1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38500 ns 47417 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81750 ns 84000 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47183 ns 46955 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1926041 ns 1925125 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1963020.5 ns 1979250 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1903354.5 ns 1970729.5 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890791.5 ns 1897750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 192410.5 ns 191790.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32519 ns 32566 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6417 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6250 ns 6458 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6520.5 ns 6459 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6209 ns 6083 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 172802 ns 174123.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31540 ns 31409 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2834 ns 2833 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2667 ns 2791 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2916 ns 2834 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2583 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 160546.5 ns 161269 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 285419125.5 ns 286258979.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340019750 ns 346927270.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 321915833.5 ns 313997291.5 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 276611333 ns 270108416 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7104026 ns 7104986 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 983693667 ns 998016667 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 937864666 ns 959348209 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 869633271 ns 851652541.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1164063416 ns 1162498166 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33938699 ns 33999768 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1680134083 ns 1672427541 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1330764145.5 ns 1705785000 ns 0.78
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1621211875 ns 1631619209 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1681717458 ns 1314128542 ns 1.28
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1405437.5 ns 1406813 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1406833 ns 1416875 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1410083 ns 1459625 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1408667 ns 1407750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127501 ns 127789 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5022916.5 ns 5022896 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5014916.5 ns 5051333 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5074667 ns 5029542 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5020083.5 ns 5031875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 549193 ns 559312.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 174892687.5 ns 169600250 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 127003959 ns 180340396 ns 0.70
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 149797333.5 ns 130036124.5 ns 1.15
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 155817792 ns 169790708.5 ns 0.92
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4879989.5 ns 5056885.5 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 621407125 ns 669854958 ns 0.93
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 648841042 ns 604244667 ns 1.07
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 575120333 ns 501867209 ns 1.15
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 681424083 ns 684062709 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 15898555 ns 16520518 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8876291.5 ns 8950666 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8737625 ns 8876958.5 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 8213750.5 ns 7849458.5 ns 1.05
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10160729 ns 10185417 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593806 ns 1594436 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35769916 ns 36026541.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36822375.5 ns 38047792 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34657916 ns 33343417 ns 1.04
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38863625 ns 38792000 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6473421 ns 6457988 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47500 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47250 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47583 ns 47584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47292 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18826 ns 18535 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50291 ns 50291 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50458 ns 50375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50854.5 ns 50417 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50333 ns 50083 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 198991.5 ns 191873 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083 ns 6458 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7584 ns 6917 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7917 ns 7750 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6458 ns 6958 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 97827 ns 91345 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10416 ns 10458 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10292 ns 9916 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9958 ns 10084 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 10208 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 550910 ns 527140.5 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5812.5 ns 5625 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6333 ns 5917 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 6958 ns 1.17
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5541 ns 5750 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 105892 ns 120543 ns 0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13792 ns 13583 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13042 ns 13354.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13584 ns 13458 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13375 ns 13000 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 499431 ns 537999 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1125 ns 1042 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33256 ns 32473 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 7917 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7958 ns 7917 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8167 ns 7959 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8084 ns 8167 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 209965 ns 206314.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23125 ns 23437.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23083 ns 23167 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23250 ns 23584 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23167 ns 23542 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18915 ns 18671 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52625 ns 52458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52584 ns 52541 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53000 ns 53458 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52854.5 ns 52062.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 277479.5 ns 291832.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458417 ns 1458937 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1400208.5 ns 1401583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1405167 ns 1403833.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1426958.5 ns 1459708.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196215 ns 195968 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5014875 ns 5008771 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4947750 ns 5044104 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4722875 ns 5017250 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5007813 ns 5011916 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 607506 ns 599687 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3062917 ns 3061000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2059667 ns 2086750 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2288458.5 ns 2304917 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4925875 ns 4539041 ns 1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 577404 ns 581670 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24414833.5 ns 24376958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18846271 ns 19122667 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19151917 ns 19181062.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36770687.5 ns 36163041 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2986349 ns 3185287.5 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34017271 ns 34039875 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28311000 ns 28717291.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28446000 ns 28156000 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41721770.5 ns 41614584 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144322334 ns 144831583 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142519708 ns 143542708 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 125442687.5 ns 124983229.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173816042 ns 173618479 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22556374 ns 22558463 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1298683979 ns 1247182979 ns 1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 886917541 ns 836595146 ns 1.06
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 716672958 ns 738893583 ns 0.97
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 736020312.5 ns 672803125 ns 1.09
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 117170580 ns 118329511 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 83729 ns 84666 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74583.5 ns 73666 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76584 ns 76146 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74125 ns 75688 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 239252.5 ns 240753.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 290521 ns 287042 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 277645.5 ns 212354 ns 1.31
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 200875 ns 296854 ns 0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 288958.5 ns 284250 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1205034 ns 1238105 ns 0.97
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35458291 ns 35497979 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35474646 ns 35870917 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32477292 ns 32110833 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40963750 ns 40961896 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5849344 ns 5843453.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148280375 ns 149169500 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153499458.5 ns 155980437.5 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 141027625 ns 134845625 ns 1.05
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 288738292 ns 287434667 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34899034 ns 34879809 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121978521.5 ns 121767709 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174111417 ns 181613625 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 157126854 ns 148039291 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104267125.5 ns 104612333.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5468756 ns 5485164 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 470236083 ns 472118833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 467564916 ns 486130458.5 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 457529645.5 ns 440650208 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 742702979.5 ns 746192375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32271141 ns 32245076 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 708676479.5 ns 643396416 ns 1.10
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 654437437.5 ns 675303249.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584295500 ns 575492166 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 854241125 ns 856961334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1340625 ns 1312541 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 955312.5 ns 677667 ns 1.41
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 751500 ns 963459 ns 0.78
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2049958 ns 2093375 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 573022 ns 580070.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2963229 ns 2966541.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2574541 ns 2496854 ns 1.03
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2393666.5 ns 2623959 ns 0.91
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3690375 ns 3704083 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1694080 ns 1730505 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6647083 ns 6656375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6495583 ns 6477624.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6493333 ns 6431167 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4435562.5 ns 4450479.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7458 ns 7375 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5959 ns 5417 ns 1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5250 ns 6084 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 9917 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25179 ns 25252 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212417 ns 212583 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219708 ns 229770.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221958 ns 220500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206542 ns 206083 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 241960 ns 251646.5 ns 0.96
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 312753979 ns 301644020.5 ns 1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 215397229 ns 280942354.5 ns 0.77
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 218945999.5 ns 189363792 ns 1.16
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 313831042 ns 305392479 ns 1.03
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7904498 ns 7676597 ns 1.03
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1087078708 ns 1087372208.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 909298979 ns 980974208 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 871782083.5 ns 865965209 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1162821625 ns 1158600916.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27057892 ns 26533591 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5687.5 ns 5354.5 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5458 ns 5375 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7084 ns 6917 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5417 ns 4958 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 133527.5 ns 146657 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7395.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7084 ns 7375 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7250 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7584 ns 7250 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 563771.5 ns 596011.5 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 666 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23871 ns 24031 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9333 ns 8917 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9291 ns 9708 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9292 ns 9583 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8667 ns 8833 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 208765 ns 216620.5 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352083 ns 353333 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352083 ns 352041 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353791.5 ns 352666.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351833.5 ns 352417 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21220 ns 21463 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 824895.5 ns 820625 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 777334 ns 828917 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 831875 ns 774875 ns 1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824062.5 ns 778729 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 254482.5 ns 269469 ns 0.94
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 332771 ns 337187.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 342541 ns 313687.5 ns 1.09
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 441895.5 ns 444709 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 330125 ns 334500 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18025 ns 17922 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 690541 ns 689958 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 744041.5 ns 746333 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1045312.5 ns 1025042 ns 1.02
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 688708 ns 694854.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 229447 ns 242950 ns 0.94
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 347291 ns 351417 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 349333.5 ns 327270.5 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 436667 ns 414729.5 ns 1.05
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 371000 ns 371750 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22527 ns 22559 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 750021 ns 747208 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 746646 ns 749416 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1076187.5 ns 1069374.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 818562.5 ns 815937.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 208987 ns 224503 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3792 ns 3708 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3542 ns 3625 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3291 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17734 ns 17855 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4459 ns 4208 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4208 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4250 ns 4333 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4250 ns 4208 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 233318 ns 248489.5 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4042 ns 3708 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3958 ns 4167 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4208 ns 4791 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3667 ns 3792 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 176017 ns 203806 ns 0.86
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8792 ns 8667 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8541 ns 8250 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 8458 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8667 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1103718 ns 1166315.5 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204375 ns 204875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209958 ns 209750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209791 ns 209834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199042 ns 200000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34881 ns 34893 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 647417 ns 602917 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 667104.5 ns 628833 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 648250 ns 621584 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630437.5 ns 592041 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347074.5 ns 321942.5 ns 1.08
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 963666 ns 978791 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 935292 ns 937250.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 967916 ns 960250 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1291958 ns 1307271 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208614 ns 207418 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4493917 ns 4504084 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4486500 ns 4619604.5 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4473646 ns 4294917 ns 1.04
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6242041.5 ns 6229292 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 938155.5 ns 936037 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3125 ns 3354 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3500 ns 3583 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4417 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3000 ns 3333 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 231776.5 ns 196464 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7334 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7417 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7291 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 6917 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1010539 ns 985634 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1636750 ns 1640792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1187333 ns 1171541.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1380354 ns 1327125 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2361520.5 ns 2384666 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214275 ns 216205.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12321562.5 ns 12345499.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9589792 ns 9603042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9486250 ns 9259895.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18012208.5 ns 18032958.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1948219 ns 1950941 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17356062.5 ns 17348083 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14365000 ns 14444583.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14592667 ns 14302167 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21122541 ns 21057645.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88041 ns 87666.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 89542 ns 89562 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98792 ns 90292 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 133292 ns 88875 ns 1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126258 ns 126565 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030917 ns 2024000 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2007375 ns 2030958.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2054917 ns 1707583 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2027125 ns 2030042 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1023675.5 ns 999913 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 339083.5 ns 343750 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 346896 ns 326145.5 ns 1.06
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 419438 ns 396833 ns 1.06
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 306854 ns 309896 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16039.5 ns 16654 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 704834 ns 702666 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 734708 ns 733666 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1027250 ns 1020166 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 642979 ns 652500 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 190568 ns 190386.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7416 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5291 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5250 ns 6000 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10041 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34186 ns 34743 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224958 ns 224334 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220812.5 ns 229333 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228020.5 ns 220959 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218645.5 ns 206292 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 310606 ns 296926 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3792 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22651 ns 23083 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14459 ns 14416 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14416 ns 14209 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14208 ns 14292 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14500 ns 14458 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 462091 ns 448235 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 137521 ns 92854 ns 1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 92958 ns 99583 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98583 ns 94542 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 135083 ns 96042 ns 1.41
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125583 ns 125978 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1927479 ns 1920562.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1907708.5 ns 1914937.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1953208 ns 1653792 ns 1.18
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1921667 ns 1928541 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 948359 ns 893203 ns 1.06
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 864916 ns 878750 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 824896 ns 800021 ns 1.03
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1172959 ns 1221729 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 959583 ns 963792 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 277617.5 ns 277692.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2822000 ns 2824834 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2517479 ns 2464958 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3358584 ns 3323271 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3372271 ns 3398958 ns 0.99
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1588161 ns 1565101.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17708 ns 17667 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17729 ns 15458.5 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18125 ns 17250.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16770.5 ns 14645.5 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145282 ns 142432.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226562.5 ns 218209 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 264687 ns 222958.5 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224583.5 ns 216334 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 260834 ns 215062.5 ns 1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 640324 ns 637432 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220083.5 ns 221145.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222125 ns 222375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222541.5 ns 220917 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 223209 ns 220333 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 272584.5 ns 280530 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 557208.5 ns 510354 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 495917 ns 499375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 508083 ns 500021 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 523021 ns 507041 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1311116 ns 1281236 ns 1.02
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 324084 ns 332250 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 333875 ns 316000 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 441292 ns 364333 ns 1.21
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 320124.5 ns 323834 ns 0.99
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17097 ns 17441 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 707729 ns 715833.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 729728.5 ns 735083 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1025875 ns 1022959 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 661375 ns 667041 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 194989.5 ns 193588.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18250 ns 18666 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19479 ns 17375 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19687.5 ns 19167 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19354 ns 17083.5 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 149984.5 ns 147781 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221000 ns 212542 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 250875 ns 214146 ns 1.17
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 253604 ns 213834 ns 1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 239125.5 ns 211354.5 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 972417 ns 877964 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4000 ns 4083 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4583 ns 4291.5 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5417 ns 5375 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4334 ns 3958 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 242942 ns 169898 ns 1.43
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11000 ns 10834 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10458 ns 10542 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11042 ns 10583 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10500 ns 10459 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1104793 ns 993411.5 ns 1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3000 ns 3417 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3667 ns 3167 ns 1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4166.5 ns 4375 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3375 ns 3062.5 ns 1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 245420.5 ns 203556.5 ns 1.21
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7791 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7584 ns 7458 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7250 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 7541 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1110479 ns 1041955 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23377292 ns 23557729 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34613667 ns 43140979 ns 0.80
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41416917 ns 37880833 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34914458 ns 34954917 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1848549 ns 1859678 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184280166 ns 184630708 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159478458 ns 172192624.5 ns 0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152113791 ns 146314396 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 414042208 ns 415449708 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16517300 ns 16494786 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 425300833 ns 428781042 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 256639521 ns 259710791 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 234680125.5 ns 231751208 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 486551125 ns 484878833 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184291 ns 183625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184583 ns 183375 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184625 ns 184417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184312.5 ns 182667 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230607.5 ns 177771.5 ns 1.30
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 631437.5 ns 590604 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 637750 ns 588083 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 639709 ns 586792 ns 1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 638791 ns 586958 ns 1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1098424 ns 1015783.5 ns 1.08
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3847292 ns 3860917 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3628167 ns 3732375 ns 0.97
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3558812.5 ns 3478062.5 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5353375 ns 5358854.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532230.5 ns 533317.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17384021 ns 17452375 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17275646 ns 17779209 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 17134208 ns 16551750 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22096708 ns 22184000 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2624383 ns 2614491.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32858 ns 32765 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9542 ns 9625 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9708 ns 9542 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9625 ns 9625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9250 ns 8917 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 268928.5 ns 263711.5 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 496244541 ns 501494042 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 428631687.5 ns 411555459 ns 1.04
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 471025146 ns 374781084 ns 1.26
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 672608229 ns 672198042 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12480226 ns 12477100 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2045534291.5 ns 2044775145.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1629126042 ns 1660536667 ns 0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1545195479.5 ns 1495631604 ns 1.03
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2203923687.5 ns 2221523375 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49226660 ns 49258137.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1635291.5 ns 1643291 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1186250 ns 1172917 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1395042 ns 1391041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2479729 ns 2338333 ns 1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 217632.5 ns 215612.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12692396 ns 12698542 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9939208 ns 9998999.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9850125 ns 9717041 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18465125 ns 18433792 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2037012 ns 2039696 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17664667 ns 17679687.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14690937.5 ns 14770854.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14825583.5 ns 14602583.5 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21424812.5 ns 21327625 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26334 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26209 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26209 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24221 ns 24225 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67209 ns 67250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66709 ns 66834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67000 ns 68166 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66917 ns 66792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 406228 ns 378162.5 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204083 ns 203125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209542 ns 208500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209375 ns 208666 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199750 ns 200125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26259 ns 26005 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 613416 ns 646625 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 667562.5 ns 628813 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 632875 ns 669895.5 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631229 ns 580791.5 ns 1.09
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355484.5 ns 311381 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657792 ns 651667 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 610750 ns 638666 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 664458 ns 647417 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646729 ns 653083.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132238 ns 131397 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2245583.5 ns 2243375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2218750 ns 2314937.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2323646 ns 2249625 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2243666 ns 2235375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1254388 ns 1114755 ns 1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17334 ns 18291 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18979 ns 17500 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19562.5 ns 20917 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17875 ns 18292 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146014 ns 143094 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230583.5 ns 223500 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233541 ns 226042 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228625 ns 262917 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 231208 ns 230125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1062578.5 ns 943015 ns 1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 666 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23616 ns 23380 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10458 ns 10104.5 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 10166 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 10000 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10167 ns 9583 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 261064 ns 254915.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5334 ns 5084 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5458 ns 5375 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6854.5 ns 6791 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 5250 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 236061.5 ns 190346.5 ns 1.24
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7541 ns 7250 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7125 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7250 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291 ns 7083 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 805015.5 ns 735734 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2209 ns 2167 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2208 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2292 ns 2209 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2417 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18184 ns 18111 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6584 ns 6750 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6584 ns 6375 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6645.5 ns 6625 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6750 ns 6625 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 332908 ns 306022.5 ns 1.09
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749687 ns 751583.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746583 ns 748875 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 748687.5 ns 746812.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749166.5 ns 748500 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21336 ns 21064 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 793250 ns 791834 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 776437.5 ns 788667 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 791542 ns 786646.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 792167 ns 792479 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 297866.5 ns 294710 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7417 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5208 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5334 ns 6000 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10084 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33133 ns 33108.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 260542 ns 228645.5 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 270770.5 ns 231416 ns 1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 236334 ns 271625 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257584 ns 225958 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 360496.5 ns 351410 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9625 ns 10292 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10125 ns 10084 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11083 ns 11166 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10791 ns 10000 ns 1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 247606.5 ns 209596.5 ns 1.18
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24583 ns 24709 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24666 ns 24333 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25209 ns 24291 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24812.5 ns 24437.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1126313 ns 1037550 ns 1.09
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106238125 ns 107199542 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 118978937.5 ns 126347334 ns 0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 124037917 ns 120468625 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117504750 ns 117762042 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2653510 ns 2637816 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 392967833 ns 393813416 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366459917 ns 380007916 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 362566708.5 ns 355873375 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 484255125 ns 484550250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15300329 ns 15152772.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 937284708 ns 939763875 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 591951208 ns 777743792 ns 0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 756812791 ns 745742833 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 951623104 ns 767071771.5 ns 1.24
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7167 ns 7167 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7375 ns 6833 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8542 ns 8458 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7084 ns 7562.5 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 237393.5 ns 228024 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14625 ns 14250 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14375 ns 14042 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14875 ns 13875 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14292 ns 13333 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1084284.5 ns 1000779 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6083 ns 6167 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 6125 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7458 ns 8250 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5959 ns 5604.5 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 233786.5 ns 214266.5 ns 1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13542 ns 12417 ns 1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12583 ns 12542 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13042 ns 12875 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12958 ns 12541 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 788004.5 ns 724930 ns 1.09
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 340292 ns 349208 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 345625 ns 326145.5 ns 1.06
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 426020.5 ns 393333 ns 1.08
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 310917 ns 314271 ns 0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16775 ns 17228 ns 0.97
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 703458.5 ns 706500 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 735146 ns 739437.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1029625 ns 1020354 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 653041.5 ns 658541 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 198474 ns 198297 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23667 ns 23935.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 6500 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6333 ns 6584 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6541 ns 6584 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6458 ns 6250 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 239358 ns 240134 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5917 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5958 ns 5917 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5834 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24385 ns 24721 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22000 ns 21500 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21083 ns 21333 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21917 ns 21292 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21375 ns 21208 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 262229.5 ns 262379.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144104.5 ns 144229.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144708 ns 144042 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 156875 ns 147292 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147646 ns 145833 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167190 ns 167351 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1330708 ns 1320395.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1308979 ns 1358771 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1381125 ns 1324084 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1321999.5 ns 1329333.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1349304 ns 1268788 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24041 ns 24083 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22166 ns 22375 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25083 ns 25104.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23375 ns 21917 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 354283 ns 280502 ns 1.26
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 180791 ns 131646 ns 1.37
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 129000 ns 121334 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 185792 ns 177687.5 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 131125 ns 130209 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1467747 ns 1380349 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 416 ns 416 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23032 ns 23199 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6792 ns 6708 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 7083 ns 0.91
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6708 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6708 ns 6083 ns 1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 255727 ns 258254.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4166 ns 5042 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4542 ns 4500 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5208.5 ns 4917 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5083 ns 4917 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256155 ns 243109 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10542 ns 10375 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10416 ns 10042 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10583 ns 10125 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10500 ns 10167 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1350723 ns 1338362 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1667 ns 1667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1542 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22939 ns 23629 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6000 ns 5875 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6000 ns 5958 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 272831 ns 278503 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6857937.5 ns 6825854.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6411896 ns 6429125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6553167 ns 6541187.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7666479 ns 7656375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213835 ns 215102 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24054084 ns 24080834 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21280000 ns 21338208 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21185791.5 ns 21079333 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29741959 ns 29660375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2122169 ns 2111008 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48688792 ns 48564000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34555896 ns 45595770.5 ns 0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 46213562.5 ns 45721854 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49348041.5 ns 38038271 ns 1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6125 ns 5687.5 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6583 ns 6041 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7375 ns 6917 ns 1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5792 ns 5375 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 235412.5 ns 239823 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9208 ns 8291 ns 1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8166 ns 8500 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8250 ns 8750 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8750 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1059126.5 ns 1069933 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1547792 ns 1555021 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1268875 ns 1235375.5 ns 1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1642125 ns 1618375 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2104375 ns 2095209 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 272083 ns 285020 ns 0.95
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7919145.5 ns 7898542 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6557000 ns 6630645.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7291375 ns 7200958 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10450458 ns 10372854.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1842888 ns 1904820 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 337250.5 ns 342000 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 349333 ns 323833 ns 1.08
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 435459 ns 382208 ns 1.14
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 338979.5 ns 342042 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46473 ns 43080 ns 1.08
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 749250 ns 725958 ns 1.03
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 785041.5 ns 782938 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1073959 ns 1067750 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 758166 ns 737041.5 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 310254 ns 314201.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397250 ns 397583 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287958 ns 211916 ns 1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 212208 ns 288208 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751500 ns 750834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44386.5 ns 44587.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 665041 ns 670500 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 531625 ns 470708 ns 1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 475375 ns 531792 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973541 ns 974083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 190915 ns 192970 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 675770.5 ns 651646 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 542104.5 ns 644458.5 ns 0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 678062.5 ns 659271 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 664542 ns 645333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132463.5 ns 132814 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2463958 ns 2440750 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2450646 ns 2525916.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2558312.5 ns 2439124.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2453333 ns 2464750 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1360044 ns 1349058.5 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 341187.5 ns 344292 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 346250.5 ns 326104 ns 1.06
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 423125 ns 393875 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 308083 ns 312896 ns 0.98
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16649 ns 16925 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 701917 ns 709938 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 730292 ns 739917 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1026459 ns 1021708 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 645479.5 ns 650083.5 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 200069 ns 202873.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1462750 ns 1458625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1503917 ns 1490666 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1495500 ns 1498417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1440500 ns 1436416 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41326 ns 41016 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5124542 ns 5105458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5186000 ns 5294583 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5326708 ns 5292167 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4970396 ns 5007208 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198810.5 ns 201135.5 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33150 ns 33479.5 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15458 ns 15292 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 15125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15167 ns 15291 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15167 ns 15042 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 380554 ns 381756.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70833 ns 71209 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 70958 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71708 ns 71125 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 70959 ns 70062.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113758.5 ns 114111 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317833 ns 318250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 332729.5 ns 329625 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 328583 ns 318708 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317875 ns 317958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 195020 ns 197229.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24076 ns 24163 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8167 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 8041 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8459 ns 8667 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7958 ns 7625 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 263092.5 ns 264271.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 465833.5 ns 464166.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 464583 ns 448167 ns 1.04
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 558958 ns 553459 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 545792 ns 548917 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130415.5 ns 129241.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1385791 ns 1380229 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1372375 ns 1393229 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1649167 ns 1619541 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1576791.5 ns 1590270.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 275691.5 ns 277974 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32361 ns 32417 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6375 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6417 ns 6500 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6541 ns 6542 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5958 ns 5958 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 266723 ns 267135 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1725542 ns 1723834 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1722333 ns 1731042 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1740249.5 ns 1722458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1775209 ns 1727375 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169600.5 ns 168945.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4369709 ns 4366646 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4361042 ns 4396958.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4418959 ns 4374416.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4359417 ns 4349500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1195476 ns 1192401 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6709 ns 6750 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6667 ns 6541 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7708 ns 7292 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6958 ns 6542 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19965 ns 20406 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 67750 ns 81771 ns 0.83
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32833 ns 49083 ns 0.67
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 69833.5 ns 72271 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51750 ns 51334 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210322.5 ns 213340.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 348792 ns 354167 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 347104 ns 329541.5 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 438584 ns 401083 ns 1.09
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 319333 ns 321771 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18398 ns 18865 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 722167 ns 722646.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 735291 ns 740500 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1042208 ns 1030625 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 674229 ns 673875 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 347323.5 ns 350549.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 74875 ns 75250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75208 ns 75250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75875 ns 75458 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75083 ns 75042 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47177 ns 47823 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324167 ns 324625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 326833 ns 341667 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 334375 ns 324250 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324000 ns 330833 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 210646 ns 216202 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484084 ns 1485500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1529667 ns 1517334 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1521625 ns 1526000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1466416 ns 1463167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52455 ns 53576 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5102666 ns 5124354.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5285791 ns 5278542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5324479 ns 5287917 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984666 ns 4986958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 207886 ns 209445 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28292 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28291 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28208 ns 28208 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28292 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24996 ns 25452 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66541 ns 66333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66250 ns 66250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66250 ns 66250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66667 ns 66333 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 535846 ns 539628 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1500625 ns 1483687.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1152708 ns 859791.5 ns 1.34
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 946334 ns 1143208 ns 0.83
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2247854 ns 2247229.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585255 ns 585407 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3068333.5 ns 3085000 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2645417 ns 2591208 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2631750 ns 2737895.5 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3815708.5 ns 3816250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2083950 ns 2035890 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8841959 ns 8818187.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8418083 ns 8953500 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8767584 ns 8776854 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6380750 ns 6365041 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 133083 ns 80791 ns 1.65
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81375 ns 79875 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82520.5 ns 82792 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80666 ns 80708 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193555.5 ns 194256.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020770.5 ns 2013375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1954167 ns 1748958 ns 1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2050958 ns 2018500 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2013083 ns 2022750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 805558 ns 809328 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant