Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform performance tests with updated GLADOS #29

Open
j-stephan opened this issue Dec 14, 2016 · 25 comments
Open

Perform performance tests with updated GLADOS #29

j-stephan opened this issue Dec 14, 2016 · 25 comments

Comments

@j-stephan
Copy link
Contributor

As described in the (poetic) title

@BieberleA
Copy link
Collaborator

Yes of course! Next year will be fine. I just added that request to keep it in mind :-)

@BieberleA
Copy link
Collaborator

Did you change also the stable version 0.2.0?

@j-stephan
Copy link
Contributor Author

Don't need to. GLADOS was updated internally, the function call itself didn't change. The stable version thus automatically profits from this change as well.

@j-stephan
Copy link
Contributor Author

j-stephan commented Feb 6, 2017

I recently profiled PARIS on my laptop GPU, the backprojection kernel now swallows 50% of computational time. This is a major improvement compared to the 98% measured in November.

@BieberleA
Copy link
Collaborator

Wau!!! What have you changed? Does it belong to the GPU performance difference? Is the performing time now faster (on our GTX cards)??? When will you be the next time at the HZDR?

@j-stephan
Copy link
Contributor Author

Two things have changed since then:

  1. The GLADOS kernel launch routines have been improved thanks to Tobias' suggestions.
  2. The GLADOS pipelines have been removed, each "stage" now executes upon its own stream.

I believe 1. to be the major performance benefactor as all that GLADOS threading overhead shouldn't affect GPU execution time.

I'll try to come on Wednesday, depending on how early I can leave university. If that doesn't work out I'll be there Thursday.

@BieberleA
Copy link
Collaborator

So, projection data buffering becomes again more importance?

@j-stephan
Copy link
Contributor Author

Maybe, I'd have to profile wait times on the CPU to see that.

@BieberleA
Copy link
Collaborator

Jan, I would like to analyze and to profile the current code together with you (and also with Stephan and Tobias). Can we spent at least 1h on Wednesday? To be honest, I do not understand both mentioned points! I always have Tobias's solution for RISA in mind - and there are obviously significant differences.

@tobiashuste
Copy link
Member

The first point relates to CUDA's execution strategy:

A block is always executed in groups of 32 threads, even if there are not enough threads (It's called a warp). This is why a block should always be a multiple of 32. Otherwise, execution units are definitely not used on the GPU.
When a CUDA kernel was invoked by the function implemented in GLADOS, this GLADOS-function computed a grid, block configuration. This computation did not result in a block-size which was a multiple of 32 (it was a small one-digit number as I remember). Like this, a lot of execution units on the GPU were not used, resulting in a low utilization.
With the adjustment the block size being a multiple of 32 is ensured and the utilization increases. This explains the major performance improvement of the back projection kernel.

@BieberleA
Copy link
Collaborator

Thanks for the explanations. My misunderstanding is more related to #2. I've become sensitive to while watching the timing profiles of both cuda students recognizing many idle areas. Am I right Jan?

@BieberleA
Copy link
Collaborator

see CPT_2016_Extend-FDK/doc/pres_backhaus_stelzig.pdf

@j-stephan
Copy link
Contributor Author

j-stephan commented Feb 6, 2017

Yes, there are idle areas. I believe* this issue has been resolved by eliminating the GLADOS pipeline from PARIS as there is no more waiting involved between stages (i.e. weighting, filtering, backprojection). Previously, the projections had to be transferred from one stage to the next which lead to blocking if both stages wanted to access the shared buffer simultaneously.

In short: GPU kernel execution is bit too fast for the host. The management of GPU data (i.e. transferring it between stages) takes longer than processing that data, resulting in the gaps seen in the presentation. By eliminating the data transfer between stages those gaps should disappear.

* I didn't have time to actually profile it / my laptop GPU seemingly doesn't support this type of profiling.

@j-stephan
Copy link
Contributor Author

As for Wednesday: I can't promise I'll actually make it to the HZDR. If we want to profile with four persons present we should rather do it on Thursday.

@BieberleA
Copy link
Collaborator

I really do not understand why GLADOS has/had a pipeline??? Many stages can be stacked together to a data pipeline, isn't it?

@BieberleA
Copy link
Collaborator

What is the difference between 1 or 4 persons???

@j-stephan
Copy link
Contributor Author

GLADOS pipelines are not useless in general. However, most stages in PARIS are executing so fast that the host management routines are actually taking longer than the GPU kernels. In this special case the GLADOS pipeline doesn't offer any benefit.

In all other cases the pipeline pattern is still useful. For the sake of an example, let us imagine that we want to execute the backprojection kernel five times in a row (independently) for a number of volumes. In this case, GPU execution per stage would consume considerably more time than the overhead involved by using the GLADOS pipeline, effectively masking it.

Why did I build the pipeline structure? Because when I first came up with the idea I didn't know about the execution time per stage, I believed them to be roughly equal. If that had been the case the GLADOS pipeline would still be useful. However, the backprojection consumes most of the time, resulting in busy-waiting for the other stages.

@j-stephan
Copy link
Contributor Author

If we want to be four persons present, we have a meeting. If we have a meeting, we have to fix a date - which I can't guarantee for Wednesday.

@BieberleA
Copy link
Collaborator

Okay .. then Thursday - 14:00 o'clock?

@j-stephan
Copy link
Contributor Author

Fine by me.

@BieberleA
Copy link
Collaborator

Tobias, Stephan, Micha: fine for you?

@MichaWagner
Copy link

ok

@tobiashuste
Copy link
Member

Ok!

@BieberleA
Copy link
Collaborator

Very well. So Jan, could you please prepare yourself with information for both mentioned points? Additionally, could you please provide a timing schedule (profiling) for let's say 2-4 projections being processed and back projected into a "full-size" volume? As example, Tobias's presented profile in the Diploma defense and in the CPC paper shell be used as reference ( profiler.pdf ). Thx.

@BodenS
Copy link

BodenS commented Feb 7, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants