Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does TensorBoard's Trace Viewer show blank waiting times? #2219

Open
nanzh-19 opened this issue Nov 19, 2024 · 6 comments
Open

Why does TensorBoard's Trace Viewer show blank waiting times? #2219

nanzh-19 opened this issue Nov 19, 2024 · 6 comments
Labels
integration Issues with integrating the library into applications platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question

Comments

@nanzh-19
Copy link

I ran the program on an x86 machine using oneDNN as the backend library and on an ARM machine using the default library. The TensorBoard profiling data shows blank waiting times on the ARM machine, while the computations on the x86 machine are very continuous. I am currently optimizing performance on the ARM machine and would like to understand the reasons behind the blank times in TensorBoard’s Trace Viewer on the ARM machine.

The TensorBoard data on the ARM machine shows a waiting gap before the two FusedMatmul operations.
image

computations on the x86 machine are very continuous
image

I would like to know the reason for the waiting gap and how to resolve it. One hypothesis is that oneDNN uses asynchronous computation, while the default uses synchronous

@vpirogov
Copy link
Member

This questions seems to be related to integration to Tensorflow, not oneDNN itself. So you'll probably get more insights by asking it on Tensorflow forum.

+@milpuz01 in case he has some insights.

@vpirogov vpirogov added integration Issues with integrating the library into applications platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 labels Nov 19, 2024
@nanzh-19
Copy link
Author

Do you know if oneDNN uses synchronous or asynchronous computation?

@Sqvid
Copy link
Contributor

Sqvid commented Nov 19, 2024

Do you know if oneDNN uses synchronous or asynchronous computation?

I believe oneDNN defaults to synchronous computation but can use async mode when built against the SYCL backend and using a stream that has been initialised with the out_of_order flag.

@mgouicem could you confirm if this is accurate?

@vpirogov
Copy link
Member

In context of Tensorflow, which seems to be the case here, computations below oneDNN API are synchronous. Though this does not prevent Tensorflow from using oneDNN asynchronously though. Hence I believe it's a Tensorflow question.

@nanzh-19
Copy link
Author

Does asynchronous computation with oneDNN have higher efficiency compared to synchronous computation? This is because the thread waiting time is shorter.

@mgouicem
Copy link
Contributor

Hi all,

I believe oneDNN defaults to synchronous computation but can use async mode when built against the SYCL backend and using a stream that has been initialised with the out_of_order flag.
@mgouicem could you confirm if this is accurate?

Yes all backend other than SYCL/OCL are synchronous by default, including the threadpool backend used by tensorflow. Though Tensorflow typically executes multiple ops concurrently based on the inter-op parallelism setting (these settings).

Regarding the stalls you are seeing, it might well be caused by the threadpool implementation used in Tensorflow for ARM platform, or the default threading configuration used for each platform. So as As Vadim said, this is a Tensorflow question and would likely be better answered there. @milpuz01 @agramesh1 if you want to chime in.

Does asynchronous computation with oneDNN have higher efficiency compared to synchronous computation? This is because the thread waiting time is shorter.

For GPU devices, asynchronous behavior allows to not block device execution with host side kernel launch. For CPU devices, if each op does not use all the cores, it might allow to run independent computations on different set of cores concurrently: though it is a double edge sword on CPU, if all cores are used by each op, it creates ressource contention and can lower performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration Issues with integrating the library into applications platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question
Projects
None yet
Development

No branches or pull requests

4 participants