-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance tuning and disabling deprecated backends #525
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #525 +/- ##
==========================================
+ Coverage 90.37% 90.70% +0.33%
==========================================
Files 35 34 -1
Lines 4456 4292 -164
==========================================
- Hits 4027 3893 -134
+ Misses 429 399 -30
☔ View full report in Codecov by Sentry. |
and for some reason windows tbb build is now working, nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks amazing! I'll update our list of required tests.
else | ||
// https://github.com/scandum/binary_search/blob/master/README.md | ||
// much faster than standard binary search on large arrays | ||
int monobound_quaternary_search(const int64_t *array, unsigned int array_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy!
@@ -185,7 +186,8 @@ struct CoplanarEdge { | |||
const int numProp; | |||
const float precision; | |||
|
|||
__host__ __device__ void operator()( | |||
// FIXME: race condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just non-determinism, or can it be worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it can be worse, as this is UB. In practice, it may or may not cause non-determinism depending on whether or not the assigned value is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just put this into a new issue
#else | ||
#define TracyAllocS(ptr, size, n) (void)0 | ||
#define TracyFreeS(ptr, n) (void)0 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of VecDH (at least originally) was to handle the fact that CUDA needs separate host and device vectors that need to be synced. With CUDA now removed, is there a chance we can remove VecDH entirely? Doesn't have to be in this PR, but it would be great if we can simplify here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is what I am going to do, but I plan to do this in another PR as that will be a lot of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I'm thinking if we may want to have vulkan acceleration later. Vulkan would require something similar or perhaps a bit more complicated (as we may not have unified memory, depending on the target hardware).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, as our VecDH
class has specialized parallel fill and uninitialized memory routines, changing to std::vector<T>
can result in more than 20% performance regression for some test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with leaving the class if it's actually doing something useful, but if it can be simplified to one internal storage instead of two, that would be great. We can always redesign it later if needed for Vulkan, but I don't want to plan early for that, especially considering that TBB ended up being just as fast as CUDA. That makes me think Vulkan may be more trouble than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can simplify that in another PR. It was really surprising when I see that switching to std::vector causes a large slowdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Vulkan, I think the only case where GPU can be faster is SDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. Maybe someday SDF will be used enough that that'll be worth looking into, but for now I'm thinking TBB is just fine.
Nice job on the benchmarks! We should really add our spheres perf benchmark to our standard tests so we remember to run them. I consider that to be more indicative of common real-world usage and scaling than these degenerate-focused tests. |
Agreed. I actually checked the spheres benchmark as well, but I forgot to include it in the plots. |
I also did some optimization for csg_tree today, I think the result is noticeable. However, a weird thing is that after changing SparseIndices API the tests in general run a bit slower, I suspect that this is due to pointer arithmetic not being optimized enough by GCC (or not inlined enough, as we originally use pointers directly but we now use methods). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent! A few nits, but I think this is ready when you are.
}; | ||
|
||
// common cases | ||
if (results.size() == 0) return std::make_shared<Manifold::Impl>(); | ||
if (results.size() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
src/utilities/CMakeLists.txt
Outdated
endif() | ||
|
||
if(EMSCRIPTEN) | ||
set(TBB_GIT_TAG 8db4efacad3b8aa6eea62281a9e8444e5dc8f16a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is the TBB on wasm thing that I accidentally commited, will remove them.
@@ -26,7 +26,7 @@ typedef Uint64 (*hash_fun_t)(Uint64); | |||
constexpr Uint64 kOpen = std::numeric_limits<Uint64>::max(); | |||
|
|||
template <typename T> | |||
__host__ __device__ T AtomicCAS(T& target, T compare, T val) { | |||
T AtomicCAS(T& target, T compare, T val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some more __CUDA_ARCH__
below here we can remove.
return ExecutionPolicy::Par; | ||
} | ||
return ExecutionPolicy::ParUnseq; | ||
return ExecutionPolicy::Par; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like another good constexpr
candidate above.
} | ||
|
||
#ifdef MANIFOLD_USE_CUDA | ||
#define THRUST_DYNAMIC_BACKEND_VOID(NAME) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
@@ -138,13 +95,12 @@ inline ExecutionPolicy autoPolicy(int size) { | |||
} \ | |||
} | |||
|
|||
#if MANIFOLD_PAR == 'T' && !(__APPLE__) | |||
#if MANIFOLD_PAR == 'T' && __has_include(<pstl/glue_execution_defs.h>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
std::cout << std::endl; | ||
} | ||
#endif | ||
|
||
private: | ||
VecDH<int64_t> data; | ||
inline int* ptr() { return reinterpret_cast<int32_t*>(data.ptrD()); } | ||
inline const int* ptr() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job making these private! Love the readability improvements here.
@@ -255,42 +232,19 @@ class ManagedVec { | |||
if (bytes >= (1ull << 63)) { | |||
throw std::bad_alloc(); | |||
} | |||
#ifdef MANIFOLD_USE_CUDA | |||
if (CudaEnabled()) | |||
cudaMallocManaged(reinterpret_cast<void **>(ptr), bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's right - I forgot you already refactored VecDH for managed memory. Looks like you've taken care of everything I'd hoped for!
and I also went ahead to implement the vec_dh refactoring. |
It seems that some thrust functions will trigger GCC warnings, not sure if they are false positives or not. The code should be basically identical to the old code, and so far I haven't got address sanitizer errors. I am not too worried about them for now. |
@elalish do you want to further review this or I will just merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
* use single vector SparseIndices * faster binary search * tracy * disable cuda and omp backends * formatting * find data races * remove some cuda stuff * rename indices to vertices * fix * better SparseIndices API * minor fixes * parallel boolean * optimize csg_tree * relax Samples.GyroidModule * small fixes * vec_dh refactor * fix compiler errors * disable some gcc warnings * only apply new warning flags for GCC
There are various things in this PR:
VecDH
for now. I can remove this commit if we don't really need this.SparseIndices
instead of two, and use faster binary search adopted from https://github.com/scandum/binary_search/blob/master/binary_search.c#L264. I will post benchmark data later today.src/third_party/thrust/thrust/system/tbb/detail/reduce.inl
using ignorelist as it is known that thrust has data race for parallel reduce. I fixed some of the simpler ones by using atomics, but there are still two cases:UpdateProperties
andCoplanarEdge
. Note that by c++ standard, data race between threads without using atomics or memory fence is undefined behavior.Closes #524