-
Notifications
You must be signed in to change notification settings - Fork 0
/
nbody.txt
217 lines (180 loc) · 14.4 KB
/
nbody.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Parallel Computations on GPU (PCG 2020)
Assignment no. 1 (cuda)
Login: xstupi00
Krok 0: základní implementace (STEPS=500, THREADS_PER_BLOCK=1024)
=============================
Velikost dat čas [s]
1 * 1024 0.815792
2 * 1024 1.61373
3 * 1024 2.40606
4 * 1024 3.21471
5 * 1024 4.00240
6 * 1024 4.79797
7 * 1024 5.60371
8 * 1024 6.39995
9 * 1024 7.25784
10 * 1024 7.99933
11 * 1024 8.80105
12 * 1024 9.59180
13 * 1024 10.4035
14 * 1024 21.8723
15 * 1024 23.4524
16 * 1024 25.1347
17 * 1024 26.7706
18 * 1024 28.2069
19 * 1024 30.0450
20 * 1024 31.6136
21 * 1024 33.3124
22 * 1024 34.7625
23 * 1024 36.6033
24 * 1024 38.2630
25 * 1024 39.9650
26 * 1024 41.5480
27 * 1024 63.2957
28 * 1024 65.6216
29 * 1024 68.0706
30 * 1024 70.5210
Vyskytla se nějaká anomálie v datech
Pokud ano, vysvětlete:
The kernel uses 36 registers for each thread - 36864 registers for each block. "Tesla K20m" provides up to 65536
registers for each block. Because the kernel uses 36864 registers for each block each SM is limited simultaneously
executing 1 block (32 warps). Each block includes 1024 threads per block, therefore each block processes 1024 particles
at once. Since "Tesla K20m" provides up to 13 SM, the maximal number of processed particles at once is limited by
13 * 1024:
13 SM Multiprocessors, 1 block per SM Multiprocessor, 1024 threads per block
13 SM Multiprocessors * 1 block per SM Multiprocessor = 13 executed blocks
13 executed blocks * 1024 threads per block = 13 * 1024
In view of these facts, we can see the performance anomalies at a larger number of input particles such as this limit.
In such case, "Tesla K20m" is not able to process all the particles at once, and some SM processor has to do the job
repeatedly. Simply put, in such a case, the particles are processed in two rounds, while below the limit, one round was
enough. The same reason causes the performance anomalies at the larger number of input elements than 26 * 1024, where
not even two rounds of processing are enough and up to three rounds of processing are performed by some of SM.
Krok 1: optimalizace kódu
=====================
Došlo ke zrychlení?
Yes, the performance acceleration was recorded. The results for the specific sizes of input are available in the file
times.txt. On average, an acceleration of about 32.50% in comparison to the previous step (Step0), was recorded.
Popište dva hlavní důvody:
The greatest impact on the increase in performance of the computation has a decrease in the number of global load
transactions. In the Step0, were performed the duplicated accesses to the global memory during the loadings of the
required data to the individual computations in the relevant kernels. Primarily, the kernel calculate_collision_velocity
repeatedly loads the data from global memory, which also the kernel calculate_gravitation_velocity loads before it.
Therefore, significantly reducing the number of accesses to global memory significantly affects the overall performance
of the computation.
The second greatest impact on the increase in the performance of the computation has a decrease in the number of FP
operations. The elimination of the duplicate calculations, primarily in the kernel calculate_collision_velocity, ensured
a significant reduction of FP operations number in the whole program. The optimization of expressions for the
calculation of relationships between particles also contributed to the reduction of the number of FP operations.
The less significant effect on performance improvement has a decrease in the overhead connected with the invoking the
individual kernels. In the first step (Step0) were invoking gradually three kernels and it required certainly overheads.
Now, we invoke only one computation kernel, which ensures the whole computation logic and therefore most of such
overheads have been eliminated
Porovnejte metriky s předchozím krokem:
Profiling command: ./nbody 30720 0.01f 1 1024 0 4096 128 ../sampledata/30720.h5 stepBOutput.h5
The individual metrics are almost the same as in the Step0 at the kernel calculate_gravitation_velocity. This is due
to the fact, that this kernel included the majority part of the computation already in Step0. On the other hand, the
kernel calculate_collision_velocity included only the minor part of whole computation and its major parts were
unnecessary duplicate calculations and repeatedly loadings from the global memory. Now, in this step, the kernel
calculate_velocity is enriched only by the calculation of the collision velocities, which consists of already calculated
partial components, and the final calculation of the new velocities and positions from the kernel update_particles.
Global Load Transactions (gld_transactions): 206 462 400 (Step0) - 117 971 520 (Step1) => 88 490 880
-> How we said above, by the combination of three kernels into only one kernel, we reduce the number of transactions
from global memory. The main reason for this statement is this, that kernel calculate_collision_velocity
performed the repeated loadings from the global memory in the Step0 (particles loading). In this step, we also
avoided the temporary velocity vector, which was also accessed within the global memory.
-> We can observe, that the number of global load transactions in this step is only slightly larger than the number
of such transactions in the previous step in the kernel calculate_gravitation_velocity:
-- Step0 (calculate_gravitation_velocity) 117 967 680 => Step1 (calculate_velocity) 117 971 520
The overall number of these transactions was decreased about the number, which was performed within the kernel
calculate_collision_velocity in the Step0, which confirms our claims.
-- Step0 (calculate_collision_velocity) 88 483 200 => Step0 - Step1 difference 88 490 880
We see that this kernel performed really unnecessary repetitive transactions from global memory.
Floating Point Operations Single Precision (flop_count_sp): 5.2848e+10 (Step0) - 3.8693e+10 (Step1) => 1.4155e+10
-> We can see, that the combination of the individual kernels into one kernel caused the reduction of the operation
number. This is because, in the Step0, some calculations have to be repeated due to the division of the
gravitation and collision velocities into several kernels, for example:
- the computation of the distance between the relevant particles
- the computation inverse distance between particles and their distances
- the addition of the partially computed velocities to the auxiliary velocities vector
We note, that the number of reduced operations is almost equal to the number of operations performed in the Step0
by the kernel calculate_collision_velocity (1.4156e+10). In this kernel was performed the most of the duplicate
calculations, which were already computed in the kernel calculate_gravitation_velocity before. This fact also
confirms the number of operations, which is almost identical in this step (3.8693e+10) as the number of
operations in the previous step performed by kernel calculate_gravitation_velocity (3.8692e+10). The slight
difference is due to the addition of operations from kernel update_particles (276480).
Krok 2: sdílená paměť
=====================
Došlo ke zrychlení?
Yes, the performance acceleration was recorded. The results for the specific sizes of input are available in the file
times.txt. On average, an acceleration of about 28.25% in comparison to the previous step (Step1), was recorded.
Zdůvodněte:
The performance improvement was primarily achieved due to a significant reduction in the number of accesses to global
memory. These accesses were replaced by using shared memory, which provides significantly faster access compared to
global memory. Threads access to the global memory only when loading the relevant data to shared memory. Further, the
principal requirement of the access to global memory is meet, and that neighbouring threads access to the adjacent
locations. This requirement at access to the global memory was not met at the previous step (Step1), since the threads
in the computation loop require accessing to the same particles data (j-th particle). Thanks to more frequently
loadings from the shared memory, the individual SM processors did not have to wait for the data loads, which caused
less downtime in the calculation.
In summary, the primary reasons which affected the performance are:
- decreasing the number of global load transactions
- increasing the number of shared load transactions
- decreasing the number of stall memory dependencies
Porovnejte metriky s předchozím krokem:
Profiling command: ./nbody 30720 0.01f 1 1024 0 4096 128 ../sampledata/30720.h5 stepBOutput.h5
The metrics in this step are significantly affected by using shared memory, in comparison to the previous step (Step1).
Shared Load Transactions (shared_load_transactions): 0 (Step1) => 117 964 800 (Step1)
-> Since in the Step1 the shared memory was not used, the number of shared load transactions was equal to zero. In
this step with the using of the shared memory was noticed the increase of this metric to the value 117 964 800.
The interesting observation is, that the number of these shared load transactions is almost equal to the number
of global load transactions in the previous step, where the shared memory was not used.
-- Global Load Transactions (Step1) = 117 971 520 => Shared Load Transactions (Step2) = 117 964 800
Thus, we can see, that the loadings from the global memory was almost completely replaced by the loadings from
the shared memory, in this step.
Issue Stall Reasons (Data Request) (stall_memory_dependency): 12.76% (Step1) => 0.01% (Step2)
-> In the Step0, was the data loading only from global memory, which takes a lot of clocks (440 clocks). Due to
these circumstances can be occurred the situation, when the SM processors have to wait to data, and by such way
can be exists down-times. The efficiency of SM processors should be low, if these preconditions were met.
-> In this Step1, the loading from the global memory was restricted and the threads primarily loadings from the
shared memory. which takes only 48 clocks. As a result, the SM processors do not have to wait for data, there
exists only minimal downtimes, and is very probably, that the efficiency of the SM processors will increase.
Global Load Transactions (gld_transactions): 117 971 520 (Step1) => 216 960 (Step2)
-> We drastically decreased the number of global load transactions by usage the shared memory. In the previous
step, we accessed to the global memory quite often and repeatedly in a loop, due to the loading the required
particles data. Now, we access to the global memory only when we load the required data from the global memory
to the shared memory, and subsequently, we work only with these loaded data.
Global Memory Load Efficiency (gld_efficiency): 12.52% (Step1) => 100.00% (Step2)
-> In the Step1, the individual threads within the block in the computation loop together accessed to the j-th
particle stored in the global memory. Such an approach was therefore not coalesced and the basic requirement
for the access to the memory did not apply. The neighbouring threads did not access to the adjacent locations,
but instead to a common location.
-> In the Step2, with the using of shared memory, this shortcoming has been remedied. The individual threads within
one block access the global memory only when load the relevant particle to the shared memory. And since it is
true that each thread in a block loads own element depending on the value of thread index, the requirements are
met. The neighbouring threads access to the adjacent locations, and therefore we have achieved a significant
increase in this metric.
Requested Global Load Throughput (gld_requested_throughput): 4.6226GB/s (Step1) => 387.20MB/s (Step2)
-> At the previous step (Step1), the threads accessed to the global memory very frequently in the inner loop of
computation, due to obtaining the data needed to calculate each iteration. For this reason, was the global load
throughout very high. In comparison, at this step, the threads access to the global memory only when loading
relevant data to the shared memory. Therefore, at the requested global load throughput was noticed the
significant decrease, since the frequency of accessing to the global memory is also significantly low.
Krok 5: analýza výkonu (steps=500)
======================
N čas CPU [s] čas GPU [s] propustnost paměti [MB/s] výkon [MFLOPS] zrychlení [-] thr_blc
128 0.463157 0.085749 238.0742 4311.7470 5.4013 32
-----------------------------------------------------------------------------------------------------------------------
256 1.82851 0.123418 185.5281 11559.2863 14.8155 256
512 7.29554 0.217018 251.3829 26057.0459 33.6172 512
1024 29.1873 0.402592 380.5525 56050.3934 72.4984 1024
2048 116.772 0.777541 677.9731 116010.8598 150.1811 1024
4096 467.235 1.535198 1241.9881 234982.7344 304.3484 1024
8192 1869.88 3.053371 2330.9791 472557.6708 612.3985 1024
16384 ~7479.52 6.639555 5893.8385 869260.2224 ~1126.5092 256
32768 ~29918.08 29.237558 8216.5539 789595.9223 ~1023.2756 512
65536 ~119672.32 117.196181 14193.225 787911.1722 ~1021.1281 512
131072 ~478689.28 411.074028 24770.7922 ~788334.1568 ~1164.4843 512
Od jakého počtu částic se vyplatí počítat na grafické kartě?
-> Since we suppose, that the optimized parallel version would be 10x faster, then is more efficient compute the
simulation on the GPU from N=256, which achieved the 14x acceleration in comparison to simple CPU version.
===================================