Obtaining Portable Performance

In order to come up with portable performance across the different devices out there we need to keep the various architectural differences in mind:

Vector Operations

How to do blocking to get best bandwidth? Experiments indicate that the best choice on CPUs is 1 thread per work group, and the number of work groups is the same as the number of cores. On the GPU, float4/double4 gives much better performance on AMD devices. Intel MIC is still to be explored better. Checks:

Does the best configuration change notably with size
Differences AMD SDK vs. NVIDIA SDK vs. INTEL SDK?

Matrix Operations

Matrices require a certain amount of padding in order to get good performance in general. For BLAS Level 2, the work horse is the sparse matrix-vector multiplication, which is built on top of the reductions performed for inner products. Some more tuning experience for the various architectures is still desired in order to pick good defaults.

Matrix-Matrix multiplications have their own set of tricks, we already have a good tuning facility for that. Good default parameters are already available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obtaining Portable Performance

Vector Operations

Matrix Operations

Clone this wiki locally