Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF Transformers ViT slower than torch.compile and raw pytorch #1502

Open
2catycm opened this issue Dec 2, 2024 · 2 comments
Open

HF Transformers ViT slower than torch.compile and raw pytorch #1502

2catycm opened this issue Dec 2, 2024 · 2 comments
Assignees

Comments

@2catycm
Copy link

2catycm commented Dec 2, 2024

readme + toy example

The first example is in the README.md,

import torch
import thunder


def foo(a, b):
    return a + b


jfoo = thunder.jit(foo)

a = torch.full((2, 2), 1)
b = torch.full((2, 2), 3)

result = jfoo(a, b)

I tried

cfoo = torch.compile(foo)
result = cfoo(a, b)

%timeit foo(a, b)
%timeit cfoo(a, b)
%timeit jfoo(a, b)

and got

2.81 μs ± 42.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
27.9 μs ± 2.32 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
106 μs ± 2.15 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

it seems it is faster using the raw pytorch.

The forward trace of thunder is

forward_trace = thunder.last_traces(jfoo)[-1].python()
print(forward_trace)
# Constructed by Unwrap the actual return value
import torch
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(a, b):
  # a: "cpu i64[2, 2]"
  # b: "cpu i64[2, 2]"
  t0 = torch.add(a, b, alpha=1)  # t0: "cpu i64[2, 2]"
  return t0

ViT Example

The toy example may be too simple, so overhead comes. So I tried a practical example.

from transformers import AutoModel
model = AutoModel.from_pretrained("WinKawaks/vit-tiny-patch16-224").cuda()
jmodel = thunder.jit(model )
jmodel (torch.randn(10, 3, 224, 224).cuda())
cmodel = torch.compile(model)
cmodel (torch.randn(10, 3, 224, 224).cuda())
%timeit model (torch.randn(10, 3, 224, 224).cuda())
%timeit cmodel (torch.randn(10, 3, 224, 224).cuda())
%timeit jmodel (torch.randn(10, 3, 224, 224).cuda())

and I got

25.8 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.1 ms ± 254 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
35 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and the compile time is also very slow, it took 1m42.8s to do the first run for jmodel, and it only took 23.5s for torch.compile.

and the trace of thunder is

# Constructed by Delete Last Used (took 10 milliseconds)
import torch
import torch.nn.functional
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(pixel_values, t_embeddings_cls_token, bias, weight, t_embeddings_position_embeddings, t_encoder_layer_0_attention_attention_key_bias, t_encoder_layer_0_attention_attention_key_weight, t_encoder_layer_0_attention_attention_query_bias, t_encoder_layer_0_attention_attention_query_weight, t_encoder_layer_0_attention_attention_value_bias, t_encoder_layer_0_attention_attention_value_weight, t_encoder_layer_0_attention_output_dense_bias, t_encoder_layer_0_attention_output_dense_weight, t_encoder_layer_0_intermediate_dense_bias, t_encoder_layer_0_intermediate_dense_weight, t_encoder_layer_0_layernorm_after_bias, t_encoder_layer_0_layernorm_after_weight, t_encoder_layer_0_layernorm_before_bias, t_encoder_layer_0_layernorm_before_weight, t_encoder_layer_0_output_dense_bias, t_encoder_layer_0_output_dense_weight, t_encoder_layer_1_attention_attention_key_bias, t_encoder_layer_1_attention_attention_key_weight, t_encoder_layer_1_attention_attention_query_bias, t_encoder_layer_1_attention_attention_query_weight, t_encoder_layer_1_attention_attention_value_bias, t_encoder_layer_1_attention_attention_value_weight, t_encoder_layer_1_attention_output_dense_bias, t_encoder_layer_1_attention_output_dense_weight, t_encoder_layer_1_intermediate_dense_bias, t_encoder_layer_1_intermediate_dense_weight, t_encoder_layer_1_layernorm_after_bias, t_encoder_layer_1_layernorm_after_weight, t_encoder_layer_1_layernorm_before_bias, t_encoder_layer_1_layernorm_before_weight, t_encoder_layer_1_output_dense_bias, t_encoder_layer_1_output_dense_weight, t_encoder_layer_2_attention_attention_key_bias, t_encoder_layer_2_attention_attention_key_weight, t_encoder_layer_2_attention_attention_query_bias, t_encoder_layer_2_attention_attention_query_weight, t_encoder_layer_2_attention_attention_value_bias, t_encoder_layer_2_attention_attention_value_weight, t_encoder_layer_2_attention_output_dense_bias, t_encoder_layer_2_attention_output_dense_weight, t_encoder_layer_2_intermediate_dense_bias, t_encoder_layer_2_intermediate_dense_weight, t_encoder_layer_2_layernorm_after_bias, t_encoder_layer_2_layernorm_after_weight, t_encoder_layer_2_layernorm_before_bias, t_encoder_layer_2_layernorm_before_weight, t_encoder_layer_2_output_dense_bias, t_encoder_layer_2_output_dense_weight, t_encoder_layer_3_attention_attention_key_bias, t_encoder_layer_3_attention_attention_key_weight, t_encoder_layer_3_attention_attention_query_bias, t_encoder_layer_3_attention_attention_query_weight, t_encoder_layer_3_attention_attention_value_bias, t_encoder_layer_3_attention_attention_value_weight, t_encoder_layer_3_attention_output_dense_bias, t_encoder_layer_3_attention_output_dense_weight, t_encoder_layer_3_intermediate_dense_bias, t_encoder_layer_3_intermediate_dense_weight, t_encoder_layer_3_layernorm_after_bias, t_encoder_layer_3_layernorm_after_weight, t_encoder_layer_3_layernorm_before_bias, t_encoder_layer_3_layernorm_before_weight, t_encoder_layer_3_output_dense_bias, t_encoder_layer_3_output_dense_weight, t_encoder_layer_4_attention_attention_key_bias, t_encoder_layer_4_attention_attention_key_weight, t_encoder_layer_4_attention_attention_query_bias, t_encoder_layer_4_attention_attention_query_weight, t_encoder_layer_4_attention_attention_value_bias, t_encoder_layer_4_attention_attention_value_weight, t_encoder_layer_4_attention_output_dense_bias, t_encoder_layer_4_attention_output_dense_weight, t_encoder_layer_4_intermediate_dense_bias, t_encoder_layer_4_intermediate_dense_weight, t_encoder_layer_4_layernorm_after_bias, t_encoder_layer_4_layernorm_after_weight, t_encoder_layer_4_layernorm_before_bias, t_encoder_layer_4_layernorm_before_weight, t_encoder_layer_4_output_dense_bias, t_encoder_layer_4_output_dense_weight, t_encoder_layer_5_attention_attention_key_bias, t_encoder_layer_5_attention_attention_key_weight, t_encoder_layer_5_attention_attention_query_bias, t_encoder_layer_5_attention_attention_query_weight, t_encoder_layer_5_attention_attention_value_bias, t_encoder_layer_5_attention_attention_value_weight, t_encoder_layer_5_attention_output_dense_bias, t_encoder_layer_5_attention_output_dense_weight, t_encoder_layer_5_intermediate_dense_bias, t_encoder_layer_5_intermediate_dense_weight, t_encoder_layer_5_layernorm_after_bias, t_encoder_layer_5_layernorm_after_weight, t_encoder_layer_5_layernorm_before_bias, t_encoder_layer_5_layernorm_before_weight, t_encoder_layer_5_output_dense_bias, t_encoder_layer_5_output_dense_weight, t_encoder_layer_6_attention_attention_key_bias, t_encoder_layer_6_attention_attention_key_weight, t_encoder_layer_6_attention_attention_query_bias, t_encoder_layer_6_attention_attention_query_weight, t_encoder_layer_6_attention_attention_value_bias, t_encoder_layer_6_attention_attention_value_weight, t_encoder_layer_6_attention_output_dense_bias, t_encoder_layer_6_attention_output_dense_weight, t_encoder_layer_6_intermediate_dense_bias, t_encoder_layer_6_intermediate_dense_weight, t_encoder_layer_6_layernorm_after_bias, t_encoder_layer_6_layernorm_after_weight, t_encoder_layer_6_layernorm_before_bias, t_encoder_layer_6_layernorm_before_weight, t_encoder_layer_6_output_dense_bias, t_encoder_layer_6_output_dense_weight, t_encoder_layer_7_attention_attention_key_bias, t_encoder_layer_7_attention_attention_key_weight, t_encoder_layer_7_attention_attention_query_bias, t_encoder_layer_7_attention_attention_query_weight, t_encoder_layer_7_attention_attention_value_bias, t_encoder_layer_7_attention_attention_value_weight, t_encoder_layer_7_attention_output_dense_bias, t_encoder_layer_7_attention_output_dense_weight, t_encoder_layer_7_intermediate_dense_bias, t_encoder_layer_7_intermediate_dense_weight, t_encoder_layer_7_layernorm_after_bias, t_encoder_layer_7_layernorm_after_weight, t_encoder_layer_7_layernorm_before_bias, t_encoder_layer_7_layernorm_before_weight, t_encoder_layer_7_output_dense_bias, t_encoder_layer_7_output_dense_weight, t_encoder_layer_8_attention_attention_key_bias, t_encoder_layer_8_attention_attention_key_weight, t_encoder_layer_8_attention_attention_query_bias, t_encoder_layer_8_attention_attention_query_weight, t_encoder_layer_8_attention_attention_value_bias, t_encoder_layer_8_attention_attention_value_weight, t_encoder_layer_8_attention_output_dense_bias, t_encoder_layer_8_attention_output_dense_weight, t_encoder_layer_8_intermediate_dense_bias, t_encoder_layer_8_intermediate_dense_weight, t_encoder_layer_8_layernorm_after_bias, t_encoder_layer_8_layernorm_after_weight, t_encoder_layer_8_layernorm_before_bias, t_encoder_layer_8_layernorm_before_weight, t_encoder_layer_8_output_dense_bias, t_encoder_layer_8_output_dense_weight, t_encoder_layer_9_attention_attention_key_bias, t_encoder_layer_9_attention_attention_key_weight, t_encoder_layer_9_attention_attention_query_bias, t_encoder_layer_9_attention_attention_query_weight, t_encoder_layer_9_attention_attention_value_bias, t_encoder_layer_9_attention_attention_value_weight, t_encoder_layer_9_attention_output_dense_bias, t_encoder_layer_9_attention_output_dense_weight, t_encoder_layer_9_intermediate_dense_bias, t_encoder_layer_9_intermediate_dense_weight, t_encoder_layer_9_layernorm_after_bias, t_encoder_layer_9_layernorm_after_weight, t_encoder_layer_9_layernorm_before_bias, t_encoder_layer_9_layernorm_before_weight, t_encoder_layer_9_output_dense_bias, t_encoder_layer_9_output_dense_weight, t_encoder_layer_10_attention_attention_key_bias, t_encoder_layer_10_attention_attention_key_weight, t_encoder_layer_10_attention_attention_query_bias, t_encoder_layer_10_attention_attention_query_weight, t_encoder_layer_10_attention_attention_value_bias, t_encoder_layer_10_attention_attention_value_weight, t_encoder_layer_10_attention_output_dense_bias, t_encoder_layer_10_attention_output_dense_weight, t_encoder_layer_10_intermediate_dense_bias, t_encoder_layer_10_intermediate_dense_weight, t_encoder_layer_10_layernorm_after_bias, t_encoder_layer_10_layernorm_after_weight, t_encoder_layer_10_layernorm_before_bias, t_encoder_layer_10_layernorm_before_weight, t_encoder_layer_10_output_dense_bias, t_encoder_layer_10_output_dense_weight, t_encoder_layer_11_attention_attention_key_bias, t_encoder_layer_11_attention_attention_key_weight, t_encoder_layer_11_attention_attention_query_bias, t_encoder_layer_11_attention_attention_query_weight, t_encoder_layer_11_attention_attention_value_bias, t_encoder_layer_11_attention_attention_value_weight, t_encoder_layer_11_attention_output_dense_bias, t_encoder_layer_11_attention_output_dense_weight, t_encoder_layer_11_intermediate_dense_bias, t_encoder_layer_11_intermediate_dense_weight, t_encoder_layer_11_layernorm_after_bias, t_encoder_layer_11_layernorm_after_weight, t_encoder_layer_11_layernorm_before_bias, t_encoder_layer_11_layernorm_before_weight, t_encoder_layer_11_output_dense_bias, t_encoder_layer_11_output_dense_weight, t_layernorm_bias, t_layernorm_weight, t_pooler_dense_bias, t_pooler_dense_weight):
  # pixel_values: "cuda:0 f32[10, 3, 224, 224]"
  # t_embeddings_cls_token: "cuda:0 f32[1, 1, 192]"
  # bias: "cuda:0 f32[192]"
  # weight: "cuda:0 f32[192, 3, 16, 16]"
  # t_embeddings_position_embeddings: "cuda:0 f32[1, 197, 192]"
  # t_encoder_layer_0_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_0_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_0_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_0_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_0_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_0_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_0_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_0_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_0_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_0_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_1_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_1_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_1_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_1_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_1_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_1_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_1_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_1_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_1_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_1_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_2_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_2_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_2_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_2_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_2_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_2_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_2_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_2_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_2_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_2_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_3_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_3_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_3_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_3_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_3_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_3_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_3_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_3_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_3_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_3_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_4_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_4_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_4_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_4_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_4_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_4_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_4_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_4_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_4_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_4_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_5_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_5_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_5_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_5_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_5_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_5_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_5_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_5_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_5_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_5_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_6_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_6_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_6_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_6_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_6_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_6_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_6_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_6_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_6_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_6_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_7_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_7_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_7_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_7_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_7_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_7_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_7_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_7_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_7_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_7_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_8_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_8_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_8_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_8_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_8_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_8_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_8_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_8_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_8_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_8_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_9_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_9_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_9_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_9_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_9_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_9_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_9_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_9_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_9_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_9_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_10_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_10_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_10_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_10_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_10_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_10_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_10_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_10_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_10_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_10_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_encoder_layer_11_attention_attention_key_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_attention_attention_key_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_11_attention_attention_query_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_attention_attention_query_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_11_attention_attention_value_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_attention_attention_value_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_11_attention_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_attention_output_dense_weight: "cuda:0 f32[192, 192]"
  # t_encoder_layer_11_intermediate_dense_bias: "cuda:0 f32[768]"
  # t_encoder_layer_11_intermediate_dense_weight: "cuda:0 f32[768, 192]"
  # t_encoder_layer_11_layernorm_after_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_layernorm_after_weight: "cuda:0 f32[192]"
  # t_encoder_layer_11_layernorm_before_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_layernorm_before_weight: "cuda:0 f32[192]"
  # t_encoder_layer_11_output_dense_bias: "cuda:0 f32[192]"
  # t_encoder_layer_11_output_dense_weight: "cuda:0 f32[192, 768]"
  # t_layernorm_bias: "cuda:0 f32[192]"
  # t_layernorm_weight: "cuda:0 f32[192]"
  # t_pooler_dense_bias: "cuda:0 f32[192]"
  # t_pooler_dense_weight: "cuda:0 f32[192, 192]"
  t33 = torch.convolution(pixel_values, weight, bias, (16, 16), (0, 0), (1, 1), False, (0, 0), 1)  # t33: "cuda:0 f32[10, 192, 14, 14]"
  [input] = TorchCompile0(t33, t_embeddings_cls_token, t_embeddings_position_embeddings)
  del t33
  [t1675, t1679, hidden_states] = nvFusion0(input, t_encoder_layer_0_layernorm_before_weight, t_encoder_layer_0_layernorm_before_bias)
  mixed_query_layer = torch.nn.functional.linear(hidden_states, t_encoder_layer_0_attention_attention_query_weight, t_encoder_layer_0_attention_attention_query_bias)  # mixed_query_layer: "cuda:0 f32[10, 197, 192]"
  x = torch.nn.functional.linear(hidden_states, t_encoder_layer_0_attention_attention_key_weight, t_encoder_layer_0_attention_attention_key_bias)  # x: "cuda:0 f32[10, 197, 192]"
  a = torch.nn.functional.linear(hidden_states, t_encoder_layer_0_attention_attention_value_weight, t_encoder_layer_0_attention_attention_value_bias)  # a: "cuda:0 f32[10, 197, 192]"
  [value_layer, query_layer, t103] = nvFusion1(x, a, mixed_query_layer)
  del x, a, mixed_query_layer
  attention_scores = torch.matmul(query_layer, t103)  # attention_scores: "cuda:0 f32[10, 3, 197, 197]"
  [attention_probs] = nvFusion2(attention_scores)
  del attention_scores
  context_layer = torch.matmul(attention_probs, value_layer)  # context_layer: "cuda:0 f32[10, 3, 197, 64]"
  [t122] = nvFusion3(context_layer)
  del context_layer
  attention_output = torch.nn.functional.linear(t122, t_encoder_layer_0_attention_output_dense_weight, t_encoder_layer_0_attention_output_dense_bias)  # attention_output: "cuda:0 f32[10, 197, 192]"
  [input_tensor, t1736, t1741, layer_output] = nvFusion4(attention_output, input, t_encoder_layer_0_layernorm_after_weight, t_encoder_layer_0_layernorm_after_bias)
  del attention_output
  t162 = torch.nn.functional.linear(layer_output, t_encoder_layer_0_intermediate_dense_weight, t_encoder_layer_0_intermediate_dense_bias)  # t162: "cuda:0 f32[10, 197, 768]"
  [t167] = nvFusion5(t162)
  t174 = torch.nn.functional.linear(t167, t_encoder_layer_0_output_dense_weight, t_encoder_layer_0_output_dense_bias)  # t174: "cuda:0 f32[10, 197, 192]"
  [t178, t1763, t1768, t204] = nvFusion6(t174, input_tensor, t_encoder_layer_1_layernorm_before_weight, t_encoder_layer_1_layernorm_before_bias)
  del t174
  t216 = torch.nn.functional.linear(t204, t_encoder_layer_1_attention_attention_query_weight, t_encoder_layer_1_attention_attention_query_bias)  # t216: "cuda:0 f32[10, 197, 192]"
  t221 = torch.nn.functional.linear(t204, t_encoder_layer_1_attention_attention_key_weight, t_encoder_layer_1_attention_attention_key_bias)  # t221: "cuda:0 f32[10, 197, 192]"
  t230 = torch.nn.functional.linear(t204, t_encoder_layer_1_attention_attention_value_weight, t_encoder_layer_1_attention_attention_value_bias)  # t230: "cuda:0 f32[10, 197, 192]"
  [t232, t234, t235] = nvFusion7(t221, t230, t216)
  del t221, t230, t216
  t236 = torch.matmul(t234, t235)  # t236: "cuda:0 f32[10, 3, 197, 197]"
  [t246] = nvFusion8(t236)
  del t236
  t250 = torch.matmul(t246, t232)  # t250: "cuda:0 f32[10, 3, 197, 64]"
  [t254] = nvFusion9(t250)
  del t250
  t261 = torch.nn.functional.linear(t254, t_encoder_layer_1_attention_output_dense_weight, t_encoder_layer_1_attention_output_dense_bias)  # t261: "cuda:0 f32[10, 197, 192]"
  [t265, t1843, t1848, t287] = nvFusion10(t261, t178, t_encoder_layer_1_layernorm_after_weight, t_encoder_layer_1_layernorm_after_bias)
  del t261
  t294 = torch.nn.functional.linear(t287, t_encoder_layer_1_intermediate_dense_weight, t_encoder_layer_1_intermediate_dense_bias)  # t294: "cuda:0 f32[10, 197, 768]"
  [t299] = nvFusion11(t294)
  t306 = torch.nn.functional.linear(t299, t_encoder_layer_1_output_dense_weight, t_encoder_layer_1_output_dense_bias)  # t306: "cuda:0 f32[10, 197, 192]"
  [t310, t1873, t1878, t336] = nvFusion12(t306, t265, t_encoder_layer_2_layernorm_before_weight, t_encoder_layer_2_layernorm_before_bias)
  del t306
  t348 = torch.nn.functional.linear(t336, t_encoder_layer_2_attention_attention_query_weight, t_encoder_layer_2_attention_attention_query_bias)  # t348: "cuda:0 f32[10, 197, 192]"
  t353 = torch.nn.functional.linear(t336, t_encoder_layer_2_attention_attention_key_weight, t_encoder_layer_2_attention_attention_key_bias)  # t353: "cuda:0 f32[10, 197, 192]"
  t362 = torch.nn.functional.linear(t336, t_encoder_layer_2_attention_attention_value_weight, t_encoder_layer_2_attention_attention_value_bias)  # t362: "cuda:0 f32[10, 197, 192]"
  [t364, t366, t367] = nvFusion13(t353, t362, t348)
  del t353, t362, t348
  t368 = torch.matmul(t366, t367)  # t368: "cuda:0 f32[10, 3, 197, 197]"
  [t378] = nvFusion14(t368)
  del t368
  t382 = torch.matmul(t378, t364)  # t382: "cuda:0 f32[10, 3, 197, 64]"
  [t386] = nvFusion15(t382)
  del t382
  t393 = torch.nn.functional.linear(t386, t_encoder_layer_2_attention_output_dense_weight, t_encoder_layer_2_attention_output_dense_bias)  # t393: "cuda:0 f32[10, 197, 192]"
  [t397, t1953, t1958, t419] = nvFusion16(t393, t310, t_encoder_layer_2_layernorm_after_weight, t_encoder_layer_2_layernorm_after_bias)
  del t393
  t426 = torch.nn.functional.linear(t419, t_encoder_layer_2_intermediate_dense_weight, t_encoder_layer_2_intermediate_dense_bias)  # t426: "cuda:0 f32[10, 197, 768]"
  [t431] = nvFusion17(t426)
  t438 = torch.nn.functional.linear(t431, t_encoder_layer_2_output_dense_weight, t_encoder_layer_2_output_dense_bias)  # t438: "cuda:0 f32[10, 197, 192]"
  [t442, t1983, t1988, t468] = nvFusion18(t438, t397, t_encoder_layer_3_layernorm_before_weight, t_encoder_layer_3_layernorm_before_bias)
  del t438
  t480 = torch.nn.functional.linear(t468, t_encoder_layer_3_attention_attention_query_weight, t_encoder_layer_3_attention_attention_query_bias)  # t480: "cuda:0 f32[10, 197, 192]"
  t485 = torch.nn.functional.linear(t468, t_encoder_layer_3_attention_attention_key_weight, t_encoder_layer_3_attention_attention_key_bias)  # t485: "cuda:0 f32[10, 197, 192]"
  t494 = torch.nn.functional.linear(t468, t_encoder_layer_3_attention_attention_value_weight, t_encoder_layer_3_attention_attention_value_bias)  # t494: "cuda:0 f32[10, 197, 192]"
  [t496, t498, t499] = nvFusion19(t485, t494, t480)
  del t485, t494, t480
  t500 = torch.matmul(t498, t499)  # t500: "cuda:0 f32[10, 3, 197, 197]"
  [t510] = nvFusion20(t500)
  del t500
  t514 = torch.matmul(t510, t496)  # t514: "cuda:0 f32[10, 3, 197, 64]"
  [t518] = nvFusion21(t514)
  del t514
  t525 = torch.nn.functional.linear(t518, t_encoder_layer_3_attention_output_dense_weight, t_encoder_layer_3_attention_output_dense_bias)  # t525: "cuda:0 f32[10, 197, 192]"
  [t529, t2063, t2068, t551] = nvFusion22(t525, t442, t_encoder_layer_3_layernorm_after_weight, t_encoder_layer_3_layernorm_after_bias)
  del t525
  t558 = torch.nn.functional.linear(t551, t_encoder_layer_3_intermediate_dense_weight, t_encoder_layer_3_intermediate_dense_bias)  # t558: "cuda:0 f32[10, 197, 768]"
  [t563] = nvFusion23(t558)
  t570 = torch.nn.functional.linear(t563, t_encoder_layer_3_output_dense_weight, t_encoder_layer_3_output_dense_bias)  # t570: "cuda:0 f32[10, 197, 192]"
  [t574, t2093, t2098, t600] = nvFusion24(t570, t529, t_encoder_layer_4_layernorm_before_weight, t_encoder_layer_4_layernorm_before_bias)
  del t570
  t612 = torch.nn.functional.linear(t600, t_encoder_layer_4_attention_attention_query_weight, t_encoder_layer_4_attention_attention_query_bias)  # t612: "cuda:0 f32[10, 197, 192]"
  t617 = torch.nn.functional.linear(t600, t_encoder_layer_4_attention_attention_key_weight, t_encoder_layer_4_attention_attention_key_bias)  # t617: "cuda:0 f32[10, 197, 192]"
  t626 = torch.nn.functional.linear(t600, t_encoder_layer_4_attention_attention_value_weight, t_encoder_layer_4_attention_attention_value_bias)  # t626: "cuda:0 f32[10, 197, 192]"
  [t628, t630, t631] = nvFusion25(t617, t626, t612)
  del t617, t626, t612
  t632 = torch.matmul(t630, t631)  # t632: "cuda:0 f32[10, 3, 197, 197]"
  [t642] = nvFusion26(t632)
  del t632
  t646 = torch.matmul(t642, t628)  # t646: "cuda:0 f32[10, 3, 197, 64]"
  [t650] = nvFusion27(t646)
  del t646
  t657 = torch.nn.functional.linear(t650, t_encoder_layer_4_attention_output_dense_weight, t_encoder_layer_4_attention_output_dense_bias)  # t657: "cuda:0 f32[10, 197, 192]"
  [t661, t2173, t2178, t683] = nvFusion28(t657, t574, t_encoder_layer_4_layernorm_after_weight, t_encoder_layer_4_layernorm_after_bias)
  del t657
  t690 = torch.nn.functional.linear(t683, t_encoder_layer_4_intermediate_dense_weight, t_encoder_layer_4_intermediate_dense_bias)  # t690: "cuda:0 f32[10, 197, 768]"
  [t695] = nvFusion29(t690)
  t702 = torch.nn.functional.linear(t695, t_encoder_layer_4_output_dense_weight, t_encoder_layer_4_output_dense_bias)  # t702: "cuda:0 f32[10, 197, 192]"
  [t706, t2203, t2208, t732] = nvFusion30(t702, t661, t_encoder_layer_5_layernorm_before_weight, t_encoder_layer_5_layernorm_before_bias)
  del t702
  t744 = torch.nn.functional.linear(t732, t_encoder_layer_5_attention_attention_query_weight, t_encoder_layer_5_attention_attention_query_bias)  # t744: "cuda:0 f32[10, 197, 192]"
  t749 = torch.nn.functional.linear(t732, t_encoder_layer_5_attention_attention_key_weight, t_encoder_layer_5_attention_attention_key_bias)  # t749: "cuda:0 f32[10, 197, 192]"
  t758 = torch.nn.functional.linear(t732, t_encoder_layer_5_attention_attention_value_weight, t_encoder_layer_5_attention_attention_value_bias)  # t758: "cuda:0 f32[10, 197, 192]"
  [t760, t762, t763] = nvFusion31(t749, t758, t744)
  del t749, t758, t744
  t764 = torch.matmul(t762, t763)  # t764: "cuda:0 f32[10, 3, 197, 197]"
  [t774] = nvFusion32(t764)
  del t764
  t778 = torch.matmul(t774, t760)  # t778: "cuda:0 f32[10, 3, 197, 64]"
  [t782] = nvFusion33(t778)
  del t778
  t789 = torch.nn.functional.linear(t782, t_encoder_layer_5_attention_output_dense_weight, t_encoder_layer_5_attention_output_dense_bias)  # t789: "cuda:0 f32[10, 197, 192]"
  [t793, t2283, t2288, t815] = nvFusion34(t789, t706, t_encoder_layer_5_layernorm_after_weight, t_encoder_layer_5_layernorm_after_bias)
  del t789
  t822 = torch.nn.functional.linear(t815, t_encoder_layer_5_intermediate_dense_weight, t_encoder_layer_5_intermediate_dense_bias)  # t822: "cuda:0 f32[10, 197, 768]"
  [t827] = nvFusion35(t822)
  t834 = torch.nn.functional.linear(t827, t_encoder_layer_5_output_dense_weight, t_encoder_layer_5_output_dense_bias)  # t834: "cuda:0 f32[10, 197, 192]"
  [t838, t2313, t2318, t864] = nvFusion36(t834, t793, t_encoder_layer_6_layernorm_before_weight, t_encoder_layer_6_layernorm_before_bias)
  del t834
  t876 = torch.nn.functional.linear(t864, t_encoder_layer_6_attention_attention_query_weight, t_encoder_layer_6_attention_attention_query_bias)  # t876: "cuda:0 f32[10, 197, 192]"
  t881 = torch.nn.functional.linear(t864, t_encoder_layer_6_attention_attention_key_weight, t_encoder_layer_6_attention_attention_key_bias)  # t881: "cuda:0 f32[10, 197, 192]"
  t890 = torch.nn.functional.linear(t864, t_encoder_layer_6_attention_attention_value_weight, t_encoder_layer_6_attention_attention_value_bias)  # t890: "cuda:0 f32[10, 197, 192]"
  [t892, t894, t895] = nvFusion37(t881, t890, t876)
  del t881, t890, t876
  t896 = torch.matmul(t894, t895)  # t896: "cuda:0 f32[10, 3, 197, 197]"
  [t906] = nvFusion38(t896)
  del t896
  t910 = torch.matmul(t906, t892)  # t910: "cuda:0 f32[10, 3, 197, 64]"
  [t914] = nvFusion39(t910)
  del t910
  t921 = torch.nn.functional.linear(t914, t_encoder_layer_6_attention_output_dense_weight, t_encoder_layer_6_attention_output_dense_bias)  # t921: "cuda:0 f32[10, 197, 192]"
  [t925, t2393, t2398, t947] = nvFusion40(t921, t838, t_encoder_layer_6_layernorm_after_weight, t_encoder_layer_6_layernorm_after_bias)
  del t921
  t954 = torch.nn.functional.linear(t947, t_encoder_layer_6_intermediate_dense_weight, t_encoder_layer_6_intermediate_dense_bias)  # t954: "cuda:0 f32[10, 197, 768]"
  [t959] = nvFusion41(t954)
  t966 = torch.nn.functional.linear(t959, t_encoder_layer_6_output_dense_weight, t_encoder_layer_6_output_dense_bias)  # t966: "cuda:0 f32[10, 197, 192]"
  [t970, t2423, t2428, t996] = nvFusion42(t966, t925, t_encoder_layer_7_layernorm_before_weight, t_encoder_layer_7_layernorm_before_bias)
  del t966
  t1008 = torch.nn.functional.linear(t996, t_encoder_layer_7_attention_attention_query_weight, t_encoder_layer_7_attention_attention_query_bias)  # t1008: "cuda:0 f32[10, 197, 192]"
  t1013 = torch.nn.functional.linear(t996, t_encoder_layer_7_attention_attention_key_weight, t_encoder_layer_7_attention_attention_key_bias)  # t1013: "cuda:0 f32[10, 197, 192]"
  t1022 = torch.nn.functional.linear(t996, t_encoder_layer_7_attention_attention_value_weight, t_encoder_layer_7_attention_attention_value_bias)  # t1022: "cuda:0 f32[10, 197, 192]"
  [t1024, t1026, t1027] = nvFusion43(t1013, t1022, t1008)
  del t1013, t1022, t1008
  t1028 = torch.matmul(t1026, t1027)  # t1028: "cuda:0 f32[10, 3, 197, 197]"
  [t1038] = nvFusion44(t1028)
  del t1028
  t1042 = torch.matmul(t1038, t1024)  # t1042: "cuda:0 f32[10, 3, 197, 64]"
  [t1046] = nvFusion45(t1042)
  del t1042
  t1053 = torch.nn.functional.linear(t1046, t_encoder_layer_7_attention_output_dense_weight, t_encoder_layer_7_attention_output_dense_bias)  # t1053: "cuda:0 f32[10, 197, 192]"
  [t1057, t2503, t2508, t1079] = nvFusion46(t1053, t970, t_encoder_layer_7_layernorm_after_weight, t_encoder_layer_7_layernorm_after_bias)
  del t1053
  t1086 = torch.nn.functional.linear(t1079, t_encoder_layer_7_intermediate_dense_weight, t_encoder_layer_7_intermediate_dense_bias)  # t1086: "cuda:0 f32[10, 197, 768]"
  [t1091] = nvFusion47(t1086)
  t1098 = torch.nn.functional.linear(t1091, t_encoder_layer_7_output_dense_weight, t_encoder_layer_7_output_dense_bias)  # t1098: "cuda:0 f32[10, 197, 192]"
  [t1102, t2533, t2538, t1128] = nvFusion48(t1098, t1057, t_encoder_layer_8_layernorm_before_weight, t_encoder_layer_8_layernorm_before_bias)
  del t1098
  t1140 = torch.nn.functional.linear(t1128, t_encoder_layer_8_attention_attention_query_weight, t_encoder_layer_8_attention_attention_query_bias)  # t1140: "cuda:0 f32[10, 197, 192]"
  t1145 = torch.nn.functional.linear(t1128, t_encoder_layer_8_attention_attention_key_weight, t_encoder_layer_8_attention_attention_key_bias)  # t1145: "cuda:0 f32[10, 197, 192]"
  t1154 = torch.nn.functional.linear(t1128, t_encoder_layer_8_attention_attention_value_weight, t_encoder_layer_8_attention_attention_value_bias)  # t1154: "cuda:0 f32[10, 197, 192]"
  [t1156, t1158, t1159] = nvFusion49(t1145, t1154, t1140)
  del t1145, t1154, t1140
  t1160 = torch.matmul(t1158, t1159)  # t1160: "cuda:0 f32[10, 3, 197, 197]"
  [t1170] = nvFusion50(t1160)
  del t1160
  t1174 = torch.matmul(t1170, t1156)  # t1174: "cuda:0 f32[10, 3, 197, 64]"
  [t1178] = nvFusion51(t1174)
  del t1174
  t1185 = torch.nn.functional.linear(t1178, t_encoder_layer_8_attention_output_dense_weight, t_encoder_layer_8_attention_output_dense_bias)  # t1185: "cuda:0 f32[10, 197, 192]"
  [t1189, t2613, t2618, t1211] = nvFusion52(t1185, t1102, t_encoder_layer_8_layernorm_after_weight, t_encoder_layer_8_layernorm_after_bias)
  del t1185
  t1218 = torch.nn.functional.linear(t1211, t_encoder_layer_8_intermediate_dense_weight, t_encoder_layer_8_intermediate_dense_bias)  # t1218: "cuda:0 f32[10, 197, 768]"
  [t1223] = nvFusion53(t1218)
  t1230 = torch.nn.functional.linear(t1223, t_encoder_layer_8_output_dense_weight, t_encoder_layer_8_output_dense_bias)  # t1230: "cuda:0 f32[10, 197, 192]"
  [t1234, t2643, t2648, t1260] = nvFusion54(t1230, t1189, t_encoder_layer_9_layernorm_before_weight, t_encoder_layer_9_layernorm_before_bias)
  del t1230
  t1272 = torch.nn.functional.linear(t1260, t_encoder_layer_9_attention_attention_query_weight, t_encoder_layer_9_attention_attention_query_bias)  # t1272: "cuda:0 f32[10, 197, 192]"
  t1277 = torch.nn.functional.linear(t1260, t_encoder_layer_9_attention_attention_key_weight, t_encoder_layer_9_attention_attention_key_bias)  # t1277: "cuda:0 f32[10, 197, 192]"
  t1286 = torch.nn.functional.linear(t1260, t_encoder_layer_9_attention_attention_value_weight, t_encoder_layer_9_attention_attention_value_bias)  # t1286: "cuda:0 f32[10, 197, 192]"
  [t1288, t1290, t1291] = nvFusion55(t1277, t1286, t1272)
  del t1277, t1286, t1272
  t1292 = torch.matmul(t1290, t1291)  # t1292: "cuda:0 f32[10, 3, 197, 197]"
  [t1302] = nvFusion56(t1292)
  del t1292
  t1306 = torch.matmul(t1302, t1288)  # t1306: "cuda:0 f32[10, 3, 197, 64]"
  [t1310] = nvFusion57(t1306)
  del t1306
  t1317 = torch.nn.functional.linear(t1310, t_encoder_layer_9_attention_output_dense_weight, t_encoder_layer_9_attention_output_dense_bias)  # t1317: "cuda:0 f32[10, 197, 192]"
  [t1321, t2723, t2728, t1343] = nvFusion58(t1317, t1234, t_encoder_layer_9_layernorm_after_weight, t_encoder_layer_9_layernorm_after_bias)
  del t1317
  t1350 = torch.nn.functional.linear(t1343, t_encoder_layer_9_intermediate_dense_weight, t_encoder_layer_9_intermediate_dense_bias)  # t1350: "cuda:0 f32[10, 197, 768]"
  [t1355] = nvFusion59(t1350)
  t1362 = torch.nn.functional.linear(t1355, t_encoder_layer_9_output_dense_weight, t_encoder_layer_9_output_dense_bias)  # t1362: "cuda:0 f32[10, 197, 192]"
  [t1366, t2753, t2758, t1392] = nvFusion60(t1362, t1321, t_encoder_layer_10_layernorm_before_weight, t_encoder_layer_10_layernorm_before_bias)
  del t1362
  t1404 = torch.nn.functional.linear(t1392, t_encoder_layer_10_attention_attention_query_weight, t_encoder_layer_10_attention_attention_query_bias)  # t1404: "cuda:0 f32[10, 197, 192]"
  t1409 = torch.nn.functional.linear(t1392, t_encoder_layer_10_attention_attention_key_weight, t_encoder_layer_10_attention_attention_key_bias)  # t1409: "cuda:0 f32[10, 197, 192]"
  t1418 = torch.nn.functional.linear(t1392, t_encoder_layer_10_attention_attention_value_weight, t_encoder_layer_10_attention_attention_value_bias)  # t1418: "cuda:0 f32[10, 197, 192]"
  [t1420, t1422, t1423] = nvFusion61(t1409, t1418, t1404)
  del t1409, t1418, t1404
  t1424 = torch.matmul(t1422, t1423)  # t1424: "cuda:0 f32[10, 3, 197, 197]"
  [t1434] = nvFusion62(t1424)
  del t1424
  t1438 = torch.matmul(t1434, t1420)  # t1438: "cuda:0 f32[10, 3, 197, 64]"
  [t1442] = nvFusion63(t1438)
  del t1438
  t1449 = torch.nn.functional.linear(t1442, t_encoder_layer_10_attention_output_dense_weight, t_encoder_layer_10_attention_output_dense_bias)  # t1449: "cuda:0 f32[10, 197, 192]"
  [t1453, t2833, t2838, t1475] = nvFusion64(t1449, t1366, t_encoder_layer_10_layernorm_after_weight, t_encoder_layer_10_layernorm_after_bias)
  del t1449
  t1482 = torch.nn.functional.linear(t1475, t_encoder_layer_10_intermediate_dense_weight, t_encoder_layer_10_intermediate_dense_bias)  # t1482: "cuda:0 f32[10, 197, 768]"
  [t1487] = nvFusion65(t1482)
  t1494 = torch.nn.functional.linear(t1487, t_encoder_layer_10_output_dense_weight, t_encoder_layer_10_output_dense_bias)  # t1494: "cuda:0 f32[10, 197, 192]"
  [t1498, t2863, t2868, t1524] = nvFusion66(t1494, t1453, t_encoder_layer_11_layernorm_before_weight, t_encoder_layer_11_layernorm_before_bias)
  del t1494
  t1536 = torch.nn.functional.linear(t1524, t_encoder_layer_11_attention_attention_query_weight, t_encoder_layer_11_attention_attention_query_bias)  # t1536: "cuda:0 f32[10, 197, 192]"
  t1541 = torch.nn.functional.linear(t1524, t_encoder_layer_11_attention_attention_key_weight, t_encoder_layer_11_attention_attention_key_bias)  # t1541: "cuda:0 f32[10, 197, 192]"
  t1550 = torch.nn.functional.linear(t1524, t_encoder_layer_11_attention_attention_value_weight, t_encoder_layer_11_attention_attention_value_bias)  # t1550: "cuda:0 f32[10, 197, 192]"
  [t1552, t1554, t1555] = nvFusion67(t1541, t1550, t1536)
  del t1541, t1550, t1536
  t1556 = torch.matmul(t1554, t1555)  # t1556: "cuda:0 f32[10, 3, 197, 197]"
  [t1566] = nvFusion68(t1556)
  del t1556
  t1570 = torch.matmul(t1566, t1552)  # t1570: "cuda:0 f32[10, 3, 197, 64]"
  [t1574] = nvFusion69(t1570)
  del t1570
  t1581 = torch.nn.functional.linear(t1574, t_encoder_layer_11_attention_output_dense_weight, t_encoder_layer_11_attention_output_dense_bias)  # t1581: "cuda:0 f32[10, 197, 192]"
  [t1585, t2943, t2948, t1607] = nvFusion70(t1581, t1498, t_encoder_layer_11_layernorm_after_weight, t_encoder_layer_11_layernorm_after_bias)
  del t1581
  t1614 = torch.nn.functional.linear(t1607, t_encoder_layer_11_intermediate_dense_weight, t_encoder_layer_11_intermediate_dense_bias)  # t1614: "cuda:0 f32[10, 197, 768]"
  [t1619] = nvFusion71(t1614)
  t1626 = torch.nn.functional.linear(t1619, t_encoder_layer_11_output_dense_weight, t_encoder_layer_11_output_dense_bias)  # t1626: "cuda:0 f32[10, 197, 192]"
  [last_hidden_state, t2973, t2978, sequence_output, first_token_tensor] = nvFusion72(t1626, t1585, t_layernorm_weight, t_layernorm_bias)
  del t1626
  pooled_output = torch.nn.functional.linear(first_token_tensor, t_pooler_dense_weight, t_pooler_dense_bias)  # pooled_output: "cuda:0 f32[10, 192]"
  [pooler_output] = nvFusion73(pooled_output)
  del pooled_output
  return {'output': transformers_modeling_outputs_BaseModelOutputWithPooling(last_hidden_state=sequence_output,pooler_output=pooler_output,hidden_states=None,attentions=None), 'flat_args': [pixel_values, t_embeddings_cls_token, bias, weight, t_embeddings_position_embeddings, t_encoder_layer_0_attention_attention_key_bias, t_encoder_layer_0_attention_attention_key_weight, t_encoder_layer_0_attention_attention_query_bias, t_encoder_layer_0_attention_attention_query_weight, t_encoder_layer_0_attention_attention_value_bias, t_encoder_layer_0_attention_attention_value_weight, t_encoder_layer_0_attention_output_dense_bias, t_encoder_layer_0_attention_output_dense_weight, t_encoder_layer_0_intermediate_dense_bias, t_encoder_layer_0_intermediate_dense_weight, t_encoder_layer_0_layernorm_after_bias, t_encoder_layer_0_layernorm_after_weight, t_encoder_layer_0_layernorm_before_bias, t_encoder_layer_0_layernorm_before_weight, t_encoder_layer_0_output_dense_bias, t_encoder_layer_0_output_dense_weight, t_encoder_layer_1_attention_attention_key_bias, t_encoder_layer_1_attention_attention_key_weight, t_encoder_layer_1_attention_attention_query_bias, t_encoder_layer_1_attention_attention_query_weight, t_encoder_layer_1_attention_attention_value_bias, t_encoder_layer_1_attention_attention_value_weight, t_encoder_layer_1_attention_output_dense_bias, t_encoder_layer_1_attention_output_dense_weight, t_encoder_layer_1_intermediate_dense_bias, t_encoder_layer_1_intermediate_dense_weight, t_encoder_layer_1_layernorm_after_bias, t_encoder_layer_1_layernorm_after_weight, t_encoder_layer_1_layernorm_before_bias, t_encoder_layer_1_layernorm_before_weight, t_encoder_layer_1_output_dense_bias, t_encoder_layer_1_output_dense_weight, t_encoder_layer_2_attention_attention_key_bias, t_encoder_layer_2_attention_attention_key_weight, t_encoder_layer_2_attention_attention_query_bias, t_encoder_layer_2_attention_attention_query_weight, t_encoder_layer_2_attention_attention_value_bias, t_encoder_layer_2_attention_attention_value_weight, t_encoder_layer_2_attention_output_dense_bias, t_encoder_layer_2_attention_output_dense_weight, t_encoder_layer_2_intermediate_dense_bias, t_encoder_layer_2_intermediate_dense_weight, t_encoder_layer_2_layernorm_after_bias, t_encoder_layer_2_layernorm_after_weight, t_encoder_layer_2_layernorm_before_bias, t_encoder_layer_2_layernorm_before_weight, t_encoder_layer_2_output_dense_bias, t_encoder_layer_2_output_dense_weight, t_encoder_layer_3_attention_attention_key_bias, t_encoder_layer_3_attention_attention_key_weight, t_encoder_layer_3_attention_attention_query_bias, t_encoder_layer_3_attention_attention_query_weight, t_encoder_layer_3_attention_attention_value_bias, t_encoder_layer_3_attention_attention_value_weight, t_encoder_layer_3_attention_output_dense_bias, t_encoder_layer_3_attention_output_dense_weight, t_encoder_layer_3_intermediate_dense_bias, t_encoder_layer_3_intermediate_dense_weight, t_encoder_layer_3_layernorm_after_bias, t_encoder_layer_3_layernorm_after_weight, t_encoder_layer_3_layernorm_before_bias, t_encoder_layer_3_layernorm_before_weight, t_encoder_layer_3_output_dense_bias, t_encoder_layer_3_output_dense_weight, t_encoder_layer_4_attention_attention_key_bias, t_encoder_layer_4_attention_attention_key_weight, t_encoder_layer_4_attention_attention_query_bias, t_encoder_layer_4_attention_attention_query_weight, t_encoder_layer_4_attention_attention_value_bias, t_encoder_layer_4_attention_attention_value_weight, t_encoder_layer_4_attention_output_dense_bias, t_encoder_layer_4_attention_output_dense_weight, t_encoder_layer_4_intermediate_dense_bias, t_encoder_layer_4_intermediate_dense_weight, t_encoder_layer_4_layernorm_after_bias, t_encoder_layer_4_layernorm_after_weight, t_encoder_layer_4_layernorm_before_bias, t_encoder_layer_4_layernorm_before_weight, t_encoder_layer_4_output_dense_bias, t_encoder_layer_4_output_dense_weight, t_encoder_layer_5_attention_attention_key_bias, t_encoder_layer_5_attention_attention_key_weight, t_encoder_layer_5_attention_attention_query_bias, t_encoder_layer_5_attention_attention_query_weight, t_encoder_layer_5_attention_attention_value_bias, t_encoder_layer_5_attention_attention_value_weight, t_encoder_layer_5_attention_output_dense_bias, t_encoder_layer_5_attention_output_dense_weight, t_encoder_layer_5_intermediate_dense_bias, t_encoder_layer_5_intermediate_dense_weight, t_encoder_layer_5_layernorm_after_bias, t_encoder_layer_5_layernorm_after_weight, t_encoder_layer_5_layernorm_before_bias, t_encoder_layer_5_layernorm_before_weight, t_encoder_layer_5_output_dense_bias, t_encoder_layer_5_output_dense_weight, t_encoder_layer_6_attention_attention_key_bias, t_encoder_layer_6_attention_attention_key_weight, t_encoder_layer_6_attention_attention_query_bias, t_encoder_layer_6_attention_attention_query_weight, t_encoder_layer_6_attention_attention_value_bias, t_encoder_layer_6_attention_attention_value_weight, t_encoder_layer_6_attention_output_dense_bias, t_encoder_layer_6_attention_output_dense_weight, t_encoder_layer_6_intermediate_dense_bias, t_encoder_layer_6_intermediate_dense_weight, t_encoder_layer_6_layernorm_after_bias, t_encoder_layer_6_layernorm_after_weight, t_encoder_layer_6_layernorm_before_bias, t_encoder_layer_6_layernorm_before_weight, t_encoder_layer_6_output_dense_bias, t_encoder_layer_6_output_dense_weight, t_encoder_layer_7_attention_attention_key_bias, t_encoder_layer_7_attention_attention_key_weight, t_encoder_layer_7_attention_attention_query_bias, t_encoder_layer_7_attention_attention_query_weight, t_encoder_layer_7_attention_attention_value_bias, t_encoder_layer_7_attention_attention_value_weight, t_encoder_layer_7_attention_output_dense_bias, t_encoder_layer_7_attention_output_dense_weight, t_encoder_layer_7_intermediate_dense_bias, t_encoder_layer_7_intermediate_dense_weight, t_encoder_layer_7_layernorm_after_bias, t_encoder_layer_7_layernorm_after_weight, t_encoder_layer_7_layernorm_before_bias, t_encoder_layer_7_layernorm_before_weight, t_encoder_layer_7_output_dense_bias, t_encoder_layer_7_output_dense_weight, t_encoder_layer_8_attention_attention_key_bias, t_encoder_layer_8_attention_attention_key_weight, t_encoder_layer_8_attention_attention_query_bias, t_encoder_layer_8_attention_attention_query_weight, t_encoder_layer_8_attention_attention_value_bias, t_encoder_layer_8_attention_attention_value_weight, t_encoder_layer_8_attention_output_dense_bias, t_encoder_layer_8_attention_output_dense_weight, t_encoder_layer_8_intermediate_dense_bias, t_encoder_layer_8_intermediate_dense_weight, t_encoder_layer_8_layernorm_after_bias, t_encoder_layer_8_layernorm_after_weight, t_encoder_layer_8_layernorm_before_bias, t_encoder_layer_8_layernorm_before_weight, t_encoder_layer_8_output_dense_bias, t_encoder_layer_8_output_dense_weight, t_encoder_layer_9_attention_attention_key_bias, t_encoder_layer_9_attention_attention_key_weight, t_encoder_layer_9_attention_attention_query_bias, t_encoder_layer_9_attention_attention_query_weight, t_encoder_layer_9_attention_attention_value_bias, t_encoder_layer_9_attention_attention_value_weight, t_encoder_layer_9_attention_output_dense_bias, t_encoder_layer_9_attention_output_dense_weight, t_encoder_layer_9_intermediate_dense_bias, t_encoder_layer_9_intermediate_dense_weight, t_encoder_layer_9_layernorm_after_bias, t_encoder_layer_9_layernorm_after_weight, t_encoder_layer_9_layernorm_before_bias, t_encoder_layer_9_layernorm_before_weight, t_encoder_layer_9_output_dense_bias, t_encoder_layer_9_output_dense_weight, t_encoder_layer_10_attention_attention_key_bias, t_encoder_layer_10_attention_attention_key_weight, t_encoder_layer_10_attention_attention_query_bias, t_encoder_layer_10_attention_attention_query_weight, t_encoder_layer_10_attention_attention_value_bias, t_encoder_layer_10_attention_attention_value_weight, t_encoder_layer_10_attention_output_dense_bias, t_encoder_layer_10_attention_output_dense_weight, t_encoder_layer_10_intermediate_dense_bias, t_encoder_layer_10_intermediate_dense_weight, t_encoder_layer_10_layernorm_after_bias, t_encoder_layer_10_layernorm_after_weight, t_encoder_layer_10_layernorm_before_bias, t_encoder_layer_10_layernorm_before_weight, t_encoder_layer_10_output_dense_bias, t_encoder_layer_10_output_dense_weight, t_encoder_layer_11_attention_attention_key_bias, t_encoder_layer_11_attention_attention_key_weight, t_encoder_layer_11_attention_attention_query_bias, t_encoder_layer_11_attention_attention_query_weight, t_encoder_layer_11_attention_attention_value_bias, t_encoder_layer_11_attention_attention_value_weight, t_encoder_layer_11_attention_output_dense_bias, t_encoder_layer_11_attention_output_dense_weight, t_encoder_layer_11_intermediate_dense_bias, t_encoder_layer_11_intermediate_dense_weight, t_encoder_layer_11_layernorm_after_bias, t_encoder_layer_11_layernorm_after_weight, t_encoder_layer_11_layernorm_before_bias, t_encoder_layer_11_layernorm_before_weight, t_encoder_layer_11_output_dense_bias, t_encoder_layer_11_output_dense_weight, t_layernorm_bias, t_layernorm_weight, t_pooler_dense_bias, t_pooler_dense_weight], 'flat_output': (None, None, sequence_output, pooler_output)}, ((attention_probs, first_token_tensor, hidden_states, input, input_tensor, last_hidden_state, layer_output, pixel_values, pooler_output, query_layer, t1024, t1026, t1027, t103, t1038, t1046, t1057, t1079, t1086, t1091, t1102, t1128, t1156, t1158, t1159, t1170, t1178, t1189, t1211, t1218, t122, t1223, t1234, t1260, t1288, t1290, t1291, t1302, t1310, t1321, t1343, t1350, t1355, t1366, t1392, t1420, t1422, t1423, t1434, t1442, t1453, t1475, t1482, t1487, t1498, t1524, t1552, t1554, t1555, t1566, t1574, t1585, t1607, t1614, t1619, t162, t167, t1675, t1679, t1736, t1741, t1763, t1768, t178, t1843, t1848, t1873, t1878, t1953, t1958, t1983, t1988, t204, t2063, t2068, t2093, t2098, t2173, t2178, t2203, t2208, t2283, t2288, t2313, t2318, t232, t234, t235, t2393, t2398, t2423, t2428, t246, t2503, t2508, t2533, t2538, t254, t2613, t2618, t2643, t2648, t265, t2723, t2728, t2753, t2758, t2833, t2838, t2863, t2868, t287, t294, t2943, t2948, t2973, t2978, t299, t310, t336, t364, t366, t367, t378, t386, t397, t419, t426, t431, t442, t468, t496, t498, t499, t510, t518, t529, t551, t558, t563, t574, t600, t628, t630, t631, t642, t650, t661, t683, t690, t695, t706, t732, t760, t762, t763, t774, t782, t793, t815, t822, t827, t838, t864, t892, t894, t895, t906, t914, t925, t947, t954, t959, t970, t996, t_encoder_layer_0_attention_attention_key_weight, t_encoder_layer_0_attention_attention_query_weight, t_encoder_layer_0_attention_attention_value_weight, t_encoder_layer_0_attention_output_dense_weight, t_encoder_layer_0_intermediate_dense_weight, t_encoder_layer_0_layernorm_after_weight, t_encoder_layer_0_layernorm_before_weight, t_encoder_layer_0_output_dense_weight, t_encoder_layer_10_attention_attention_key_weight, t_encoder_layer_10_attention_attention_query_weight, t_encoder_layer_10_attention_attention_value_weight, t_encoder_layer_10_attention_output_dense_weight, t_encoder_layer_10_intermediate_dense_weight, t_encoder_layer_10_layernorm_after_weight, t_encoder_layer_10_layernorm_before_weight, t_encoder_layer_10_output_dense_weight, t_encoder_layer_11_attention_attention_key_weight, t_encoder_layer_11_attention_attention_query_weight, t_encoder_layer_11_attention_attention_value_weight, t_encoder_layer_11_attention_output_dense_weight, t_encoder_layer_11_intermediate_dense_weight, t_encoder_layer_11_layernorm_after_weight, t_encoder_layer_11_layernorm_before_weight, t_encoder_layer_11_output_dense_weight, t_encoder_layer_1_attention_attention_key_weight, t_encoder_layer_1_attention_attention_query_weight, t_encoder_layer_1_attention_attention_value_weight, t_encoder_layer_1_attention_output_dense_weight, t_encoder_layer_1_intermediate_dense_weight, t_encoder_layer_1_layernorm_after_weight, t_encoder_layer_1_layernorm_before_weight, t_encoder_layer_1_output_dense_weight, t_encoder_layer_2_attention_attention_key_weight, t_encoder_layer_2_attention_attention_query_weight, t_encoder_layer_2_attention_attention_value_weight, t_encoder_layer_2_attention_output_dense_weight, t_encoder_layer_2_intermediate_dense_weight, t_encoder_layer_2_layernorm_after_weight, t_encoder_layer_2_layernorm_before_weight, t_encoder_layer_2_output_dense_weight, t_encoder_layer_3_attention_attention_key_weight, t_encoder_layer_3_attention_attention_query_weight, t_encoder_layer_3_attention_attention_value_weight, t_encoder_layer_3_attention_output_dense_weight, t_encoder_layer_3_intermediate_dense_weight, t_encoder_layer_3_layernorm_after_weight, t_encoder_layer_3_layernorm_before_weight, t_encoder_layer_3_output_dense_weight, t_encoder_layer_4_attention_attention_key_weight, t_encoder_layer_4_attention_attention_query_weight, t_encoder_layer_4_attention_attention_value_weight, t_encoder_layer_4_attention_output_dense_weight, t_encoder_layer_4_intermediate_dense_weight, t_encoder_layer_4_layernorm_after_weight, t_encoder_layer_4_layernorm_before_weight, t_encoder_layer_4_output_dense_weight, t_encoder_layer_5_attention_attention_key_weight, t_encoder_layer_5_attention_attention_query_weight, t_encoder_layer_5_attention_attention_value_weight, t_encoder_layer_5_attention_output_dense_weight, t_encoder_layer_5_intermediate_dense_weight, t_encoder_layer_5_layernorm_after_weight, t_encoder_layer_5_layernorm_before_weight, t_encoder_layer_5_output_dense_weight, t_encoder_layer_6_attention_attention_key_weight, t_encoder_layer_6_attention_attention_query_weight, t_encoder_layer_6_attention_attention_value_weight, t_encoder_layer_6_attention_output_dense_weight, t_encoder_layer_6_intermediate_dense_weight, t_encoder_layer_6_layernorm_after_weight, t_encoder_layer_6_layernorm_before_weight, t_encoder_layer_6_output_dense_weight, t_encoder_layer_7_attention_attention_key_weight, t_encoder_layer_7_attention_attention_query_weight, t_encoder_layer_7_attention_attention_value_weight, t_encoder_layer_7_attention_output_dense_weight, t_encoder_layer_7_intermediate_dense_weight, t_encoder_layer_7_layernorm_after_weight, t_encoder_layer_7_layernorm_before_weight, t_encoder_layer_7_output_dense_weight, t_encoder_layer_8_attention_attention_key_weight, t_encoder_layer_8_attention_attention_query_weight, t_encoder_layer_8_attention_attention_value_weight, t_encoder_layer_8_attention_output_dense_weight, t_encoder_layer_8_intermediate_dense_weight, t_encoder_layer_8_layernorm_after_weight, t_encoder_layer_8_layernorm_before_weight, t_encoder_layer_8_output_dense_weight, t_encoder_layer_9_attention_attention_key_weight, t_encoder_layer_9_attention_attention_query_weight, t_encoder_layer_9_attention_attention_value_weight, t_encoder_layer_9_attention_output_dense_weight, t_encoder_layer_9_intermediate_dense_weight, t_encoder_layer_9_layernorm_after_weight, t_encoder_layer_9_layernorm_before_weight, t_encoder_layer_9_output_dense_weight, t_layernorm_weight, t_pooler_dense_weight, value_layer), ())
@t-vi
Copy link
Collaborator

t-vi commented Dec 2, 2024

Hi, thank you for the detailed report!
The two examples have quite different significance to us:

  • The trivial example cannot really be expected to give a speedup. We will work on the README to have a more relevant hello world.
  • So we do see performance things around with some PyTorch idioms used by HF transformers and are investigating them (see also the recent episodes of the Thunder Sessions). In the mean time, for text uses, LitGPT gives better performance, I should really fix toroidal to use the built-in attention to have a great ViT implementation. :)

@t-vi t-vi changed the title Slower than torch.compile and raw pytorch on two examples. HF Transformers ViT slower than torch.compile and raw pytorch Dec 2, 2024
@tfogal tfogal removed their assignment Dec 13, 2024
@tfogal
Copy link
Collaborator

tfogal commented Dec 13, 2024

I ran:

import timeit
import thunder
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("WinKawaks/vit-tiny-patch16-224").cuda()
jmodel = thunder.jit(model)
jmodel(torch.randn((10, 3, 224, 224), device='cuda:0'))
cmodel = torch.compile(model)
cmodel(torch.randn((10, 3, 224, 224), device='cuda:0'))

mdl = timeit.timeit(model(torch.randn((10, 3, 224, 224), device='cuda:0')))
tc = timeit.timeit(cmodel(torch.randn((10, 3, 224, 224), device='cuda:0')))
th = timeit.timeit(jmodel(torch.randn((10, 3, 224, 224), device='cuda:0')))
print(f"timings: {mdl=}, {tc=}, {th=}")

nvtx.push_range("eager")
model(torch.randn(10, 3, 224, 224).cuda())
nvtx.pop_range()

nvtx.push_range("torch.compile alone")
cmodel(torch.randn(10, 3, 224, 224).cuda())
nvtx.pop_range()

nvtx.push_range("thunder")
jmodel(torch.randn(10, 3, 224, 224).cuda())
nvtx.pop_range()

On my ada6k I see: timings: mdl=6.309307054965757, tc=6.352527196984738, th=6.336398062005173. So, interestingly, thunder is faster than torch.compile but eager is faster than both.

Inside nsys, Thunder shows up as quite a bit slower:
image

In this particular case we're hit by significant host latency, as evidenced by the white gaps in the CUDA HW timeline. The most important near-term change we could do is have the nvFuser or cuDNN-FE executors take more ops so that we're not constantly going back and swapping executors in thunder (which is slow).

The following issues detail fixes that are more granular and direct but probably relevant to this issue:

With #1467 (or #981?) being the most relevant.

Assigning to Melissa so we discuss and figure out next steps--do we want to keep this open as a tracking issue or should we just defer to the related issues?

To the original poster:

  • Thanks for your detailed report! It turns out that both of these models are too small to benefit here. When we automagically cuda graph most/all things it should be a different story.
  • Until then, you'll want to be careful to fill up the GPU by increasing batch sizes or sequence lengths or similar to see benefits.
  • Alternatively, you can try manually applying the CUDAGraphTransform via the transforms argument to thunder.jit (or thunderfx's ThunderCompiler). This is not guaranteed to work in all cases but when it works it should resolve this particular performance problem.
  • Side note: torch.randn(sz, device="cuda") will create a tensor on the GPU. I worry that torch.randn(sz).cuda() creates a CPU tensor and then does a memcpy to the GPU, which would be slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants