-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
End-to-end inference doesn't be accelerated. #39
Comments
Unfortunately, ToMe is not magic. It can only possibly speed up the total inference time if the inference time if bottlenecked by the model. So if you aren't performing enough computation to actually saturate your graphics card, or if the eval has to wait on something else in your pipeline (e.g., dataloading) then no model-based method can speed up your pipeline. That being said, have you tried checking if ToMe improves inference speed if you just time inference, not the whole pipeline? As a sanity check. If ToMe properly reduces that speed, then what that means is your pipeline is just constantly waiting on the dataloader. It doesn't matter how fast the model is---ViT-Ti or ResNet-50 or whatever---you'd get the same overall time because the dataloader can't load images fast enough. |
Thank you for your suggestions. I first validated that ToMe was installed correctly by using the examples/1_benchmark_timm.ipynb you provided, and I was able to measure the improvement in throughput. Back to the previous issue, I have broken down the complete E2E inference into three parts: I have following questions:
|
I think you're misunderstanding how cuda calls work. Most cuda calls are asynchronous and thus return immediately, meaning that timing the call itself is not the right thing to do. In order to force the code to wait for cuda operations to complete, you should do a |
Hi, thanks for your excellent work!
I'm quite interested in your approach to speedup ViT's throughput. However, when I implement ViT-B end-to-end inference (including data Input, preprocessing, and model inference), the processing time is the same whether using ToMe or not. I even tried using different batch_size to fill the GPU memory, but the results are still the same.
Here's the result:
- device: each row using a RTX3090 GPU
- dataset: ImageNet-1k validation set
For every test case, I only change the model or batch_size. Other components for data Input, preprocessing.... are the same. (the same device and code)
My question is why the "Total Inference Time" of models with ToMe are similar to baseline (No ToMe)? Didn't throughput mean the efficiency for model inference? Even if I didn't optimize the code for data input and data preprocessing, the "Total Inference Time" still should smaller than the baseline because the ToMe can speed up the time spent in model inference.
Did I misunderstand something?
The text was updated successfully, but these errors were encountered: