Perform performance tests with updated GLADOS #29

j-stephan · 2016-12-14T15:51:08Z

As described in the (poetic) title

BieberleA · 2016-12-14T21:02:24Z

Yes of course! Next year will be fine. I just added that request to keep it in mind :-)

BieberleA · 2016-12-18T15:09:34Z

Did you change also the stable version 0.2.0?

j-stephan · 2016-12-18T15:13:22Z

Don't need to. GLADOS was updated internally, the function call itself didn't change. The stable version thus automatically profits from this change as well.

j-stephan · 2017-02-06T07:31:52Z

I recently profiled PARIS on my laptop GPU, the backprojection kernel now swallows 50% of computational time. This is a major improvement compared to the 98% measured in November.

BieberleA · 2017-02-06T07:55:15Z

Wau!!! What have you changed? Does it belong to the GPU performance difference? Is the performing time now faster (on our GTX cards)??? When will you be the next time at the HZDR?

j-stephan · 2017-02-06T08:48:18Z

Two things have changed since then:

The GLADOS kernel launch routines have been improved thanks to Tobias' suggestions.
The GLADOS pipelines have been removed, each "stage" now executes upon its own stream.

I believe 1. to be the major performance benefactor as all that GLADOS threading overhead shouldn't affect GPU execution time.

I'll try to come on Wednesday, depending on how early I can leave university. If that doesn't work out I'll be there Thursday.

BieberleA · 2017-02-06T10:00:05Z

So, projection data buffering becomes again more importance?

j-stephan · 2017-02-06T13:21:14Z

Maybe, I'd have to profile wait times on the CPU to see that.

BieberleA · 2017-02-06T13:47:25Z

Jan, I would like to analyze and to profile the current code together with you (and also with Stephan and Tobias). Can we spent at least 1h on Wednesday? To be honest, I do not understand both mentioned points! I always have Tobias's solution for RISA in mind - and there are obviously significant differences.

tobiashuste · 2017-02-06T14:06:01Z

The first point relates to CUDA's execution strategy:

A block is always executed in groups of 32 threads, even if there are not enough threads (It's called a warp). This is why a block should always be a multiple of 32. Otherwise, execution units are definitely not used on the GPU.
When a CUDA kernel was invoked by the function implemented in GLADOS, this GLADOS-function computed a grid, block configuration. This computation did not result in a block-size which was a multiple of 32 (it was a small one-digit number as I remember). Like this, a lot of execution units on the GPU were not used, resulting in a low utilization.
With the adjustment the block size being a multiple of 32 is ensured and the utilization increases. This explains the major performance improvement of the back projection kernel.

BieberleA · 2017-02-06T14:12:36Z

Thanks for the explanations. My misunderstanding is more related to #2. I've become sensitive to while watching the timing profiles of both cuda students recognizing many idle areas. Am I right Jan?

BieberleA · 2017-02-06T14:16:15Z

see CPT_2016_Extend-FDK/doc/pres_backhaus_stelzig.pdf

j-stephan · 2017-02-06T15:17:23Z

Yes, there are idle areas. I believe* this issue has been resolved by eliminating the GLADOS pipeline from PARIS as there is no more waiting involved between stages (i.e. weighting, filtering, backprojection). Previously, the projections had to be transferred from one stage to the next which lead to blocking if both stages wanted to access the shared buffer simultaneously.

In short: GPU kernel execution is bit too fast for the host. The management of GPU data (i.e. transferring it between stages) takes longer than processing that data, resulting in the gaps seen in the presentation. By eliminating the data transfer between stages those gaps should disappear.

* I didn't have time to actually profile it / my laptop GPU seemingly doesn't support this type of profiling.

j-stephan · 2017-02-06T15:19:53Z

As for Wednesday: I can't promise I'll actually make it to the HZDR. If we want to profile with four persons present we should rather do it on Thursday.

BieberleA · 2017-02-06T15:24:16Z

I really do not understand why GLADOS has/had a pipeline??? Many stages can be stacked together to a data pipeline, isn't it?

BieberleA · 2017-02-06T15:26:07Z

What is the difference between 1 or 4 persons???

j-stephan · 2017-02-06T15:34:59Z

GLADOS pipelines are not useless in general. However, most stages in PARIS are executing so fast that the host management routines are actually taking longer than the GPU kernels. In this special case the GLADOS pipeline doesn't offer any benefit.

In all other cases the pipeline pattern is still useful. For the sake of an example, let us imagine that we want to execute the backprojection kernel five times in a row (independently) for a number of volumes. In this case, GPU execution per stage would consume considerably more time than the overhead involved by using the GLADOS pipeline, effectively masking it.

Why did I build the pipeline structure? Because when I first came up with the idea I didn't know about the execution time per stage, I believed them to be roughly equal. If that had been the case the GLADOS pipeline would still be useful. However, the backprojection consumes most of the time, resulting in busy-waiting for the other stages.

j-stephan · 2017-02-06T15:36:43Z

If we want to be four persons present, we have a meeting. If we have a meeting, we have to fix a date - which I can't guarantee for Wednesday.

BieberleA · 2017-02-06T15:40:25Z

Okay .. then Thursday - 14:00 o'clock?

j-stephan · 2017-02-06T15:41:19Z

Fine by me.

BieberleA · 2017-02-06T15:41:50Z

Tobias, Stephan, Micha: fine for you?

MichaWagner · 2017-02-06T15:49:08Z

ok

tobiashuste · 2017-02-06T18:25:04Z

Ok!

BieberleA · 2017-02-06T21:01:01Z

Very well. So Jan, could you please prepare yourself with information for both mentioned points? Additionally, could you please provide a timing schedule (profiling) for let's say 2-4 projections being processed and back projected into a "full-size" volume? As example, Tobias's presented profile in the Diploma defense and in the CPC paper shell be used as reference ( profiler.pdf ). Thx.

BodenS · 2017-02-07T07:04:16Z

Ok From: AB [mailto:[email protected]] Sent: Montag, 6. Februar 2017 16:40 To: HZDR-FWDF/PARIS Cc: BodenS; Assign Subject: Re: [HZDR-FWDF/PARIS] Perform performance tests with updated GLADOS (#29) Okay .. then Thursday - 14:00 o'clock? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#29 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/ATP0xyB_PKr_26Khq3duunX29eGeaWfpks5rZz7pgaJpZM4LNFHy> . <https://github.com/notifications/beacon/ATP0x6EvV9jsS2-B-Lcyl9ro-nP2dKkyks5rZz7pgaJpZM4LNFHy.gif>

j-stephan added the question label Dec 14, 2016

j-stephan self-assigned this Dec 14, 2016

j-stephan mentioned this issue Dec 14, 2016

glados::cuda::launch-kernel does not create good configuration hzdr/GLADOS#11

Closed

j-stephan added the optimization label Feb 6, 2017

BieberleA assigned tobiashuste, BodenS and MichaWagner Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform performance tests with updated GLADOS #29

Perform performance tests with updated GLADOS #29

j-stephan commented Dec 14, 2016

BieberleA commented Dec 14, 2016

BieberleA commented Dec 18, 2016

j-stephan commented Dec 18, 2016

j-stephan commented Feb 6, 2017 •

edited

Loading

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

tobiashuste commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017 •

edited

Loading

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

MichaWagner commented Feb 6, 2017

tobiashuste commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BodenS commented Feb 7, 2017 via email

Perform performance tests with updated GLADOS #29

Perform performance tests with updated GLADOS #29

Comments

j-stephan commented Dec 14, 2016

BieberleA commented Dec 14, 2016

BieberleA commented Dec 18, 2016

j-stephan commented Dec 18, 2016

j-stephan commented Feb 6, 2017 • edited Loading

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

tobiashuste commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017 • edited Loading

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

j-stephan commented Feb 6, 2017

BieberleA commented Feb 6, 2017

MichaWagner commented Feb 6, 2017

tobiashuste commented Feb 6, 2017

BieberleA commented Feb 6, 2017

BodenS commented Feb 7, 2017 via email

j-stephan commented Feb 6, 2017 •

edited

Loading

j-stephan commented Feb 6, 2017 •

edited

Loading