Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] calculate each reflection separately #79

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

slizzered
Copy link
Contributor

This is a test to see if we can split the calculation of a sample point
into multiple kernel-calls (one per reflection slice). The reason is,
that our current code computes all reflection slices in a single huge
array. This old style had several disadvantages that could be fixed:

  • the array numberOfReflectionSlices is huge. As big as
    indicesOfPrisms, and so it is part of the bottleneck for the number
    of rays.
    • removing this array cuts our memory requirements almost by 50% (for
      really high ray numbers)
    • the kernel does not have to do the memory lookup to find the
      correct reflection_i (all reflections are in the same plane anyway)
  • the arrays numberOfReflectionSlices and raysPerPrism are
    actually linearized 2D arrays that contain all the reflection planes.
    This leads to more difficult code while we do the mapRaysToPrisms.
    • if there is only one reflection plane at a time, this makes it much
      easier to split the numbers of rays even further to allow more rays
      in total (see Number of rays is limited to GPU memory #2).
    • the resulting code is more maintainable

This is nice and all, but splitting the reflections might introduce some
problems:

  • if there are reflections, we will do a lot more kernel calls and each
    of those might be quite small. So maybe the GPU is not utilized as
    much. Previously, everything was done in 1 huge kernel call.
  • since we don't know how many rays there are in each plane, we have to
    call thrust::reduce in each iteration.
  • since we need multiple ray schedules (one for each reflection plane),
    we also need to call the mapRaysToPrisms in each iteration.

All in all, the performance implications need to be tested. I believe
that this commit can improve long-term code quality and will directly
enable #2. But if the performance suffers, we might need to code some
workaround (maybe use the split functionality only for really high ray
numbers where the tradeoff is not so bad and we really NEED it).

@slizzered
Copy link
Contributor Author

So I did some tests and throughput drops quite significantly for lower numbers of rays. I will follow this up with some profiling, maybe there is a way to optimize the overhead away (CUDA streams might be a solution if the GPU is not fully utilized).

Setup

Executed on Node Kepler002 in the Hypnos cluster

C example ./bin/calcPhiASE -c calcPhiASE.cfg --min-rays=X where X and the executable[1] vary between the runs. The config file is the one supplied with the example in the current old except for min-rays:

minRays runtime old[s] runtime new[s] throughput old/new
10^5 137 224 0.61
10^6 448 531 0.84
10^7 3100** 3150** 0.98

* old is the current dev 2272f9b patched with 726b047
new is basically old but additionally patched with 0973d1a

** runtimes estimated after 10% of the simulation were completed. These times should be representative enough to get a good grasp on the performance implications.

@slizzered slizzered force-pushed the test-split_reflection_slices branch from c674b4a to 42cf48b Compare July 9, 2015 13:57
@slizzered
Copy link
Contributor Author

Ok so I did some refactoring and debugging, the code got a lot faster. As an added benefit, it would be trivial to add CUDA streams.

minRays runtime old[s] runtime new[s] throughput old/new
10^5 137 190 0.72
10^6 448 476 0.94
10^7 3100** 3000** 1.03

* old is the current dev 2272f9b patched with 726b047
new is basically old but additionally patched with 0973d1a and 42cf48b

** runtimes estimated after 10% of the simulation were completed. These times should be representative enough to get a good grasp on the performance implications.

This is a test to see if we can split the calculation of a sample point
into multiple kernel-calls (one per reflection slice). The reason is,
that our current code computes all reflection slices in a single huge
array. This old style had several disadvantages that could be fixed:

 - the array `numberOfReflectionSlices` is **huge**. As big as
   indicesOfPrisms, and so it is part of the bottleneck for the number
   of rays.
   - removing this array cuts our memory requirements almost by 50% (for
     really high ray numbers)
   - the kernel does not have to do the memory lookup to find the
     correct reflection_i (all reflections are in the same plane anyway)
 - the arrays `numberOfReflectionSlices` and `raysPerPrism` are
   actually linearized 2D arrays that contain all the reflection planes.
   This leads to more difficult code while we do the `mapRaysToPrisms`.
   - if there is only one reflection plane at a time, this makes it much
     easier to split the numbers of rays even further to allow more rays
     in total (see ComputationalRadiationPhysics#2).
   - the resulting code is more maintainable

This is nice and all, but splitting the reflections might introduce some
problems:

 - if there are reflections, we will do a lot more kernel calls and each
   of those might be quite small. So maybe the GPU is not utilized as
   much. Previously, everything was done in 1 huge kernel call.
 - since we don't know how many rays there are in each plane, we have to
   call thrust::reduce in each iteration.
 - since we need multiple ray schedules (one for each reflection plane),
   we also need to call the mapRaysToPrisms in each iteration.

All in all, the performance implications need to be tested. I believe
that this commit can improve long-term code quality and will directly
enable ComputationalRadiationPhysics#2. But if the performance suffers, we might need to code some
workaround (maybe use the split functionality only for really high ray
numbers where the tradeoff is not so bad and we really NEED it).
 - Re-use all vectors as much as possible
 - one device-vector is created inside the loop (which seems wasteful at
   first), to enable an easy transition to CUDA streams in the future :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant