-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does TensorBoard's Trace Viewer show blank waiting times? #2219
Comments
This questions seems to be related to integration to Tensorflow, not oneDNN itself. So you'll probably get more insights by asking it on Tensorflow forum. +@milpuz01 in case he has some insights. |
Do you know if oneDNN uses synchronous or asynchronous computation? |
I believe oneDNN defaults to synchronous computation but can use async mode when built against the SYCL backend and using a stream that has been initialised with the @mgouicem could you confirm if this is accurate? |
In context of Tensorflow, which seems to be the case here, computations below oneDNN API are synchronous. Though this does not prevent Tensorflow from using oneDNN asynchronously though. Hence I believe it's a Tensorflow question. |
Does asynchronous computation with oneDNN have higher efficiency compared to synchronous computation? This is because the thread waiting time is shorter. |
Hi all,
Yes all backend other than SYCL/OCL are synchronous by default, including the threadpool backend used by tensorflow. Though Tensorflow typically executes multiple ops concurrently based on the inter-op parallelism setting (these settings). Regarding the stalls you are seeing, it might well be caused by the threadpool implementation used in Tensorflow for ARM platform, or the default threading configuration used for each platform. So as As Vadim said, this is a Tensorflow question and would likely be better answered there. @milpuz01 @agramesh1 if you want to chime in.
For GPU devices, asynchronous behavior allows to not block device execution with host side kernel launch. For CPU devices, if each op does not use all the cores, it might allow to run independent computations on different set of cores concurrently: though it is a double edge sword on CPU, if all cores are used by each op, it creates ressource contention and can lower performance. |
I ran the program on an x86 machine using oneDNN as the backend library and on an ARM machine using the default library. The TensorBoard profiling data shows blank waiting times on the ARM machine, while the computations on the x86 machine are very continuous. I am currently optimizing performance on the ARM machine and would like to understand the reasons behind the blank times in TensorBoard’s Trace Viewer on the ARM machine.
The TensorBoard data on the ARM machine shows a waiting gap before the two FusedMatmul operations.
computations on the x86 machine are very continuous
I would like to know the reason for the waiting gap and how to resolve it. One hypothesis is that oneDNN uses asynchronous computation, while the default uses synchronous
The text was updated successfully, but these errors were encountered: