Skip to content

h100 timings for CCO‐Surface30

Eric Bylaska edited this page May 22, 2024 · 7 revisions

5/21/2024 - CCO-surface30 Benchmark

  • This example is FFT dominant.

Date: 5/21/2024

Directory: /home/bylaska/PWDFT3/QA/CCO-Cu_surface30

The table contains performance timings for the computational task on the given machine with varying numbers of CPU cores (ncpus). The timings are presented in seconds (cputime) and are broken down into different components:

  • non-local: Timings for non-local operations.
  • ffm: Timings for ffm operations.
  • fmf: Timings for fmf operations.
  • fft: Timings for FFT (Fast Fourier Transform) operations.
  • diagonalize: Timings for diagonalize operations.

"In the CUDA binary, FFT operations are exclusively performed using GPU, while BLAS3 operations are executed on the GPU. Additionally, it's important to note that the GPUs become overloaded after reaching a threshold of ncpus=6."

Directory: /home/bylaska/PWDFT3/QA/CCO-Cu_surface30

machine ngpus cputime non-local ffm fmf fft diagonalize
h100 8 5.741e+00 6.280e-01 5.799e-02 5.322e-02 4.843e+00
h100 8ns 9.297e+00 6.639e-01 5.412e-02 6.039e-02 8.371e+00
h100 16 4.728e+00 3.345e-01 3.540e-02 2.810e-02 4.247e+00
h100 16ns 8.007e+00 3.308e-01 3.972e-02 2.685e-02 7.965e+00
h100 24 4.465e+00 2.349e-01 3.258e-02 1.446e-02 4.136e+00
h100 24ns 8.148e+00 2.338e-01 3.213e-02 1.766e-02 7.816e+00
h100 32 4.330e+00 1.805e-01 2.644e-02 1.115e-02 4.068e+00
h100 32ns 7.805e+00 1.813e-01 2.720e-02 1.283e-02 7.747e+00
h100 40 4.272e+00 1.510e-01 3.136e-02 8.825e-03 4.042e+00
h100 48 4.212e+00 1.261e-01 2.370e-02 7.642e-03 4.022e+00
h100 64 4.150e+00 1.022e-01 2.473e-02 5.407e-03 3.992e+00
h100 64ns 7.805e+00 1.035e-01 2.790e-02 5.609e-03 7.649e+00
h100 80 4.152e+00 9.994e-02 2.903e-02 4.021e-03 3.996e+00
h100 96 4.098e+00 7.455e-02 2.655e-02 3.684e-03 3.973e+00
h100 128 4.230e+00 8.085e-02 3.030e-02 2.749e-03 4.102e+00
h100 128ns 7.899e+00 7.598e-02 3.137e-02 2.894e-03 7.784e+00
h100 160 4.209e+00 6.614e-02 3.238e-02 2.234e-03 4.096e+00
h100 200

|

The table presents the total and component times for different numbers of CPU cores (ncpus). The optimal timings for each component are indicated by bold values.