nbody.txt

Parallel Computations on GPU (PCG 2020)
Assignment no. 1 (cuda)
Login: xstupi00


Krok 0: základní implementace (STEPS=500, THREADS_PER_BLOCK=1024)
=============================
Velikost dat    	čas [s]
 1 * 1024 			 0.815792
 2 * 1024 			 1.61373
 3 * 1024 			 2.40606
 4 * 1024 			 3.21471
 5 * 1024 			 4.00240
 6 * 1024 			 4.79797
 7 * 1024 			 5.60371
 8 * 1024 			 6.39995
 9 * 1024 			 7.25784
10 * 1024 			 7.99933
11 * 1024 			 8.80105
12 * 1024 			 9.59180
13 * 1024 			 10.4035
14 * 1024 			 21.8723
15 * 1024 			 23.4524
16 * 1024 			 25.1347
17 * 1024 			 26.7706
18 * 1024 			 28.2069
19 * 1024 			 30.0450
20 * 1024 			 31.6136
21 * 1024 			 33.3124
22 * 1024 			 34.7625
23 * 1024 			 36.6033
24 * 1024 			 38.2630
25 * 1024 			 39.9650
26 * 1024 			 41.5480
27 * 1024 			 63.2957
28 * 1024 			 65.6216
29 * 1024 			 68.0706
30 * 1024 			 70.5210

Vyskytla se nějaká anomálie v datech
Pokud ano, vysvětlete:

The kernel uses 36 registers for each thread - 36864 registers for each block. "Tesla K20m" provides up to 65536
registers for each block. Because the kernel uses 36864 registers for each block each SM is limited simultaneously
executing 1 block (32 warps). Each block includes 1024 threads per block, therefore each block processes 1024 particles
at once. Since "Tesla K20m" provides up to 13 SM, the maximal number of processed particles at once is limited by
13 * 1024:

13 SM Multiprocessors, 1 block per SM Multiprocessor, 1024 threads per block
13 SM Multiprocessors * 1 block per SM Multiprocessor = 13 executed blocks
13 executed blocks * 1024 threads per block = 13 * 1024

In view of these facts, we can see the performance anomalies at a larger number of input particles such as this limit.
In such case, "Tesla K20m" is not able to process all the particles at once, and some SM processor has to do the job
repeatedly. Simply put, in such a case, the particles are processed in two rounds, while below the limit, one round was
enough. The same reason causes the performance anomalies at the larger number of input elements than 26 * 1024, where
not even two rounds of processing are enough and up to three rounds of processing are performed by some of SM.


Krok 1: optimalizace kódu
=====================
Došlo ke zrychlení?

Yes, the performance acceleration was recorded. The results for the specific sizes of input are available in the file
times.txt. On average, an acceleration of about 32.50% in comparison to the previous step (Step0), was recorded.

Popište dva hlavní důvody:

The greatest impact on the increase in performance of the computation has a decrease in the number of global load
transactions. In the Step0, were performed the duplicated accesses to the global memory during the loadings of the
required data to the individual computations in the relevant kernels. Primarily, the kernel calculate_collision_velocity
repeatedly loads the data from global memory, which also the kernel calculate_gravitation_velocity loads before it.
Therefore, significantly reducing the number of accesses to global memory significantly affects the overall performance
of the computation.

The second greatest impact on the increase in the performance of the computation has a decrease in the number of FP
operations. The elimination of the duplicate calculations, primarily in the kernel calculate_collision_velocity, ensured
a significant reduction of FP operations number in the whole program. The optimization of expressions for the
calculation of relationships between particles also contributed to the reduction of the number of FP operations.

The less significant effect on performance improvement has a decrease in the overhead connected with the invoking the
individual kernels. In the first step (Step0) were invoking gradually three kernels and it required certainly overheads.
Now, we invoke only one computation kernel, which ensures the whole computation logic and therefore most of such
overheads have been eliminated


Porovnejte metriky s předchozím krokem:

Profiling command: ./nbody 30720 0.01f 1 1024 0 4096 128 ../sampledata/30720.h5 stepBOutput.h5

The individual metrics are almost the same as in the Step0 at the kernel calculate_gravitation_velocity. This is due
to the fact, that this kernel included the majority part of the computation already in Step0. On the other hand, the
kernel calculate_collision_velocity included only the minor part of whole computation and its major parts were
unnecessary duplicate calculations and repeatedly loadings from the global memory. Now, in this step, the kernel
calculate_velocity is enriched only by the calculation of the collision velocities, which consists of already calculated
partial components, and the final calculation of the new velocities and positions from the kernel update_particles.

Global Load Transactions (gld_transactions): 206 462 400 (Step0) - 117 971 520 (Step1) => 88 490 880
    -> How we said above, by the combination of three kernels into only one kernel, we reduce the number of transactions
       from global memory. The main reason for this statement is this, that kernel calculate_collision_velocity
       performed the repeated loadings from the global memory in the Step0 (particles loading). In this step, we also
       avoided the temporary velocity vector, which was also accessed within the global memory.
    -> We can observe, that the number of global load transactions in this step is only slightly larger than the number
       of such transactions in the previous step in the kernel calculate_gravitation_velocity:
        -- Step0 (calculate_gravitation_velocity) 117 967 680 => Step1 (calculate_velocity) 117 971 520
       The overall number of these transactions was decreased about the number, which was performed within the kernel
       calculate_collision_velocity in the Step0, which confirms our claims.
        -- Step0 (calculate_collision_velocity) 88 483 200 => Step0 - Step1 difference 88 490 880
       We see that this kernel performed really unnecessary repetitive transactions from global memory.

Floating Point Operations Single Precision (flop_count_sp): 5.2848e+10 (Step0) - 3.8693e+10 (Step1) => 1.4155e+10
    -> We can see, that the combination of the individual kernels into one kernel caused the reduction of the operation
       number. This is because, in the Step0, some calculations have to be repeated due to the division of the
       gravitation and collision velocities into several kernels, for example:
        - the computation of the distance between the relevant particles
        - the computation inverse distance between particles and their distances
        - the addition of the partially computed velocities to the auxiliary velocities vector
       We note, that the number of reduced operations is almost equal to the number of operations performed in the Step0
       by the kernel calculate_collision_velocity (1.4156e+10). In this kernel was performed the most of the duplicate
       calculations, which were already computed in the kernel calculate_gravitation_velocity before. This fact also
       confirms the number of operations, which is almost identical in this step (3.8693e+10) as the number of
       operations in the previous step performed by kernel calculate_gravitation_velocity (3.8692e+10). The slight
       difference is due to the addition of operations from kernel update_particles (276480).


Krok 2: sdílená paměť
=====================
Došlo ke zrychlení?

Yes, the performance acceleration was recorded. The results for the specific sizes of input are available in the file
times.txt. On average, an acceleration of about 28.25% in comparison to the previous step (Step1), was recorded.

Zdůvodněte:

The performance improvement was primarily achieved due to a significant reduction in the number of accesses to global
memory. These accesses were replaced by using shared memory, which provides significantly faster access compared to
global memory. Threads access to the global memory only when loading the relevant data to shared memory. Further, the
principal requirement of the access to global memory is meet, and that neighbouring threads access to the adjacent
locations. This requirement at access to the global memory was not met at the previous step (Step1), since the threads
in the computation loop require accessing to the same particles data (j-th particle). Thanks to more frequently
loadings from the shared memory, the individual SM processors did not have to wait for the data loads, which caused
less downtime in the calculation.

In summary, the primary reasons which affected the performance are:
    - decreasing the number of global load transactions
    - increasing the number of shared load transactions
    - decreasing the number of stall memory dependencies

Porovnejte metriky s předchozím krokem:

Profiling command: ./nbody 30720 0.01f 1 1024 0 4096 128 ../sampledata/30720.h5 stepBOutput.h5

The metrics in this step are significantly affected by using shared memory, in comparison to the previous step (Step1).

Shared Load Transactions (shared_load_transactions): 0 (Step1) => 117 964 800 (Step1)
    -> Since in the Step1 the shared memory was not used, the number of shared load transactions was equal to zero. In
       this step with the using of the shared memory was noticed the increase of this metric to the value 117 964 800.
       The interesting observation is, that the number of these shared load transactions is almost equal to the number
       of global load transactions in the previous step, where the shared memory was not used.
        -- Global Load Transactions (Step1) = 117 971 520 => Shared Load Transactions (Step2) = 117 964 800
       Thus, we can see, that the loadings from the global memory was almost completely replaced by the loadings from
       the shared memory, in this step.

Issue Stall Reasons (Data Request) (stall_memory_dependency): 12.76% (Step1) => 0.01% (Step2)
    -> In the Step0, was the data loading only from global memory, which takes a lot of clocks (440 clocks). Due to
       these circumstances can be occurred the situation, when the SM processors have to wait to data, and by such way
       can be exists down-times. The efficiency of SM processors should be low, if these preconditions were met.
    -> In this Step1, the loading from the global memory was restricted and the threads primarily loadings from the
       shared memory. which takes only 48 clocks. As a result, the SM processors do not have to wait for data, there
       exists only minimal downtimes, and is very probably, that the efficiency of the SM processors will increase.

Global Load Transactions (gld_transactions): 117 971 520 (Step1) => 216 960 (Step2)
    -> We drastically decreased the number of global load transactions by usage the shared memory. In the previous
       step, we accessed to the global memory quite often and repeatedly in a loop, due to the loading the required
       particles data. Now, we access to the global memory only when we load the required data from the global memory
       to the shared memory, and subsequently, we work only with these loaded data.

Global Memory Load Efficiency (gld_efficiency): 12.52% (Step1) => 100.00% (Step2)
    -> In the Step1, the individual threads within the block in the computation loop together accessed to the j-th
       particle stored in the global memory. Such an approach was therefore not coalesced and the basic requirement
       for the access to the memory did not apply. The neighbouring threads did not access to the adjacent locations,
       but instead to a common location.
    -> In the Step2, with the using of shared memory, this shortcoming has been remedied. The individual threads within
       one block access the global memory only when load the relevant particle to the shared memory. And since it is
       true that each thread in a block loads own element depending on the value of thread index, the requirements are
       met. The neighbouring threads access to the adjacent locations, and therefore we have achieved a significant
       increase in this metric.

Requested Global Load Throughput (gld_requested_throughput): 4.6226GB/s (Step1) => 387.20MB/s (Step2)
    -> At the previous step (Step1), the threads accessed to the global memory very frequently in the inner loop of
       computation, due to obtaining the data needed to calculate each iteration. For this reason, was the global load
       throughout very high. In comparison, at this step, the threads access to the global memory only when loading
       relevant data to the shared memory. Therefore, at the requested global load throughput was noticed the
       significant decrease, since the frequency of accessing to the global memory is also significantly low.


Krok 5: analýza výkonu (steps=500)
======================
N        čas CPU [s]    čas GPU [s]    propustnost paměti [MB/s]    výkon [MFLOPS]    zrychlení [-]     thr_blc
128       0.463157       0.085749              238.0742                4311.7470         5.4013           32
-----------------------------------------------------------------------------------------------------------------------
256       1.82851        0.123418              185.5281                11559.2863        14.8155          256
512       7.29554        0.217018              251.3829                26057.0459        33.6172          512
1024      29.1873        0.402592              380.5525                56050.3934        72.4984          1024
2048      116.772        0.777541              677.9731                116010.8598       150.1811         1024
4096      467.235        1.535198              1241.9881               234982.7344       304.3484         1024
8192      1869.88        3.053371              2330.9791               472557.6708       612.3985         1024
16384     ~7479.52       6.639555              5893.8385               869260.2224      ~1126.5092        256
32768     ~29918.08      29.237558             8216.5539               789595.9223      ~1023.2756        512
65536     ~119672.32     117.196181            14193.225               787911.1722      ~1021.1281        512
131072    ~478689.28     411.074028            24770.7922              ~788334.1568     ~1164.4843        512

Od jakého počtu částic se vyplatí počítat na grafické kartě?
   -> Since we suppose, that the optimized parallel version would be 10x faster, then is more efficient compute the
      simulation on the GPU from N=256, which achieved the 14x acceleration in comparison to simple CPU version.

===================================