GitHub - AD2605/Reductions: Highly Performant Sum reduction in CUDA

A simple Device Callable, highly performant Sum reduction implemented in cuda at various levels, Grid, Cluster, Block, Warp and Thread.

Achieves ~95-97% of theoritical bandwidth across architectures. For now, only works for multiples of 4, though boundary checking can be easily added

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
defines.hpp		defines.hpp
grid_reduction.hpp		grid_reduction.hpp
reduction.cu		reduction.cu
utils.hpp		utils.hpp

Provide feedback