Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-to-end inference doesn't be accelerated. #39

Open
rwshihhh opened this issue May 10, 2024 · 3 comments
Open

End-to-end inference doesn't be accelerated. #39

rwshihhh opened this issue May 10, 2024 · 3 comments

Comments

@rwshihhh
Copy link

Hi, thanks for your excellent work!
I'm quite interested in your approach to speedup ViT's throughput. However, when I implement ViT-B end-to-end inference (including data Input, preprocessing, and model inference), the processing time is the same whether using ToMe or not. I even tried using different batch_size to fill the GPU memory, but the results are still the same.
Here's the result:
- device: each row using a RTX3090 GPU
- dataset: ImageNet-1k validation set
end-to-end_result

For every test case, I only change the model or batch_size. Other components for data Input, preprocessing.... are the same. (the same device and code)

My question is why the "Total Inference Time" of models with ToMe are similar to baseline (No ToMe)? Didn't throughput mean the efficiency for model inference? Even if I didn't optimize the code for data input and data preprocessing, the "Total Inference Time" still should smaller than the baseline because the ToMe can speed up the time spent in model inference.
Did I misunderstand something?

@dbolya
Copy link
Contributor

dbolya commented May 10, 2024

Unfortunately, ToMe is not magic. It can only possibly speed up the total inference time if the inference time if bottlenecked by the model. So if you aren't performing enough computation to actually saturate your graphics card, or if the eval has to wait on something else in your pipeline (e.g., dataloading) then no model-based method can speed up your pipeline.

That being said, have you tried checking if ToMe improves inference speed if you just time inference, not the whole pipeline? As a sanity check.

If ToMe properly reduces that speed, then what that means is your pipeline is just constantly waiting on the dataloader. It doesn't matter how fast the model is---ViT-Ti or ResNet-50 or whatever---you'd get the same overall time because the dataloader can't load images fast enough.

@rwshihhh
Copy link
Author

Thank you for your suggestions. I first validated that ToMe was installed correctly by using the examples/1_benchmark_timm.ipynb you provided, and I was able to measure the improvement in throughput.

Back to the previous issue, I have broken down the complete E2E inference into three parts:
Part 1. load data from disk to DRAM to GPU. (Including data preprocessing) -> Variable in the code: count_load_whole
Part 2. model inference, e.g., code's model(input) -> Variable in the code: count_model
Part 3. remaining parts, e.g., calculate label (inference accuracy) -> Variable in the code: count_label_cal
000

I have following questions:

  1. When you mention that ToMe improves inference speed, are you referring to throughput (images/sec)? If so, does it relate to Part 2 of my E2E inference code? If that's the case, why does using ToMe make Part 2 take longer? For instance, when r=0, Part 2 takes 7.4 seconds, but when r=13, it takes 12.4 seconds.

  2. Since ToMe's approach is inserting a bipartite algorithm between ViT's Attention and MLP, shouldn't it only affect the model's architecture and its processing time? If so, why does using ToMe also change the processing time of Part 1 and Part 3? For example, with a higher r value, it reduces the time consumed by Part 3 and increases Part 1's.

@dbolya
Copy link
Contributor

dbolya commented May 13, 2024

I think you're misunderstanding how cuda calls work. Most cuda calls are asynchronous and thus return immediately, meaning that timing the call itself is not the right thing to do. In order to force the code to wait for cuda operations to complete, you should do a torch.cuda.synchronize() before every time you sample the current time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants