-
Notifications
You must be signed in to change notification settings - Fork 0
/
physical.tex
1364 lines (1296 loc) · 59.8 KB
/
physical.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Due Date: 6/30/14
\chapter{Physical Dependence Analysis}
\label{chapter:physical}
After tasks have completed the logical dependence
analysis stage of the Legion task pipeline, they
then proceed to the mapping stage of the pipeline
(see Chapter~\ref{chapter:arch}). The process
of mapping entails selecting a processor on which
to execute each task and then determining memories
in which to place physical instances for each
of the logical regions on which the task has requested
privileges. It is the responsibility of the runtime
to compute all of the data movement operations and
preconditions necessary for actually executing a
task in accordance with the requested logical
region privileges and coherence. We refer to this
process as physical dependence analysis as it
involves actually computing the operations necessary
for mapping a Legion application onto real hardware.
To ensure that dependences and data movement
operations are determined correctly, a task $T$
is only allowed to map when all of the other tasks
in the dynamic dependence graph on which $T$
has a dependence have mapped. This invariant
guarantees that when a task $T$ maps, it
will see the correct version of the meta-data
stored in the physical state of the region
tree forest and pick up any necessary dependences
on other tasks and operations that came before
$T$ in program order. An important property of
this invariant is that it permits Legion to
map some tasks while logical dependence
analysis is being performed on other tasks. It
is impossible for task $T$ to record a mapping
dependence on any task that comes after it in
program order. This property allows Legion to
execute these stages of the pipeline in parallel
for different tasks, aiding in the
latency hiding of the dynamic analysis.
In many ways the physical dependence analysis
for the mapping stage is similar to the logical
dependence analysis: both algorithms involve
traversals over the region tree forest to determine
dependences. However, there are two primary differences.
First, tasks can begin physical dependence
analysis once all their mapping dependences
have been satisfied, permitting multiple tasks to
be performing their physical dependence analysis
in parallel. Second, the state maintained in
the region tree forest will be different: the
runtime must track the various physical instances
of logical regions as well as the low-level events
necessary for computing preconditions for low-level
runtime operations.
In this chapter we describe the physical
dependence analysis that implements the
mapping stage of the task pipeline. We begin
in Section~\ref{sec:phytraversal} by giving
a general overview of the traversal algorithm.
In Section~\ref{sec:phystree} we cover
the details of the data structures and
optimizations we apply to make the physical
region tree traversal fast.
Section~\ref{sec:physicalcases} covers
the implementation of several important
features of the Legion programming model
including virtual mappings, reduction
instances, composite instances, and compound
region requirements. Finally, in
Section~\ref{sec:paratraversal} we describe
the necessary algorithms for enabling
parallel traversal of the region
tree forest for non-interfering tasks.
\section{Physical Region Tree Traversal Algorithm}
\label{sec:phytraversal}
The physical region tree traversal algorithm is
in some ways similar to the logical region tree
traversal algorithm described in
Chapter~\ref{chapter:logical}. However, the
goal of the physical traversal algorithm is not
to construct a dynamic dependence graph, but instead
to issue the appropriate operations to the
Legion low-level runtime (see
Section~\ref{subsec:lowlevel}) necessary for the
execution of tasks and other operations.
Since the issuing of low-level operations
also corresponds to the binding of an application
to the hardware, the physical traversal algorithm
makes several mapper queries to determine
how each task should be mapped onto the target
hardware, ensuring that all decisions that
impact performance can be controlled by the
application developer. A partial semantics of
these mapper calls are given in this chapter,
while a complete description of the queries that
can be made to mapper objects can be found in
Chapter~\ref{chapter:mapping}.
In order to map a task, a target processor must
be first be selected via a mapping query. After
the processor is selected the runtime requests
that a mapper determine a ranking of memories
for each region requirement by an additional
mapping call. Based on these rankings, the runtime
attempts to map each of the region requirements
to a physical instance in one of the memories
in the target ranking\footnote{As part of this
process, the runtime also filters memories that
are not visible from the target processor to
maintain the correctness of the eventual mapping.}.
In order to map each of the different region
requirements, the runtime either locates or
tries to create a physical instance in one of
the target memories. After a physical instance
is selected, the necessary copy operations are
determined and event dependences computed. To
perform these actions for each region requirement
of a task, four different stages are required. Each
stage requires a traversal of a different
component of the physical state of the region
tree forest. We discuss each of these different
traversals in turn.
\subsection{Premapping Traversal}
\label{sec:premapping}
The first traversal of the region tree is
the {\em premapping traversal}. For each
region requirement in the task, the
premapping traversal walks the physical
region tree from where the parent task had
privileges to where region requirement is
requesting privileges similar to logical
region traversal. However, instead of searching
for interfering tasks, this traversal
{\em opens} all sub-trees along the path and
performs any necessary close operations
recorded by the logical traversal in
Section~\ref{subsec:logicalstate} that have
yet to be performed. The intuition is that
the logical dependence analysis algorithm ensures
that tasks map in the proper order by
recording dependences, therefore, in order
to guarantee that the physical region tree
remains in the proper state, only the open
and close operations need to be performed
during this first stage. If any close
operations along the path have already been
performed by another task in the same epoch
then they can be skipped.
Unlike close operations in the logical region
tree traversal that only mutate the state
of the meta-data in the region tree forest,
close operations performed on the physical
state are translated to low-level runtime
operations. To perform a close operation,
a physical instance of the logical region
at the root of the sub-tree to be closed,
must first be either selected or created.
(Note this physical instance must also
have sufficient space for all the fields
that are being closed.) Since the decisions
about placement and layout of this physical
instance can have performance ramifications,
the {\tt rank\_copy\_targets} mapper call
is invoked to pick both the location and
layout of the physical instance to be used
as the target of the close operation.
The mapper query is told about all the existing
physical instances that meet the necessary
qualifications in terms of space and needed
fields, and the mapper can decide whether to
re-use an existing instance or request the
creation of a new instance in any number of
memories. The mapper also has the ability to
choose if multiple instances should be created
in different memories for the close, which
is a useful feature when there may be many
tasks running in different parts of the machine
that will require a copy of the data in the closed
logical region. Like many mapper calls, it is
possible for the {\tt rank\_copy\_targets} mapper
call to {\em fail-to-map}. If there exists no
physical instance to re-use, and all the target
memories selected by the mapper are full, then
the mapping call will fail. Under such
circumstances, the premapping traversal will
also fail. When this occurs the mapper is notified
by the {\tt notify\_failed\_mapping} call
and the task is placed back on the ready
queue, where the mapper can choose when to
next try mapping the task.
Once one or more target instances for the close
operation have been successfully identified, the
close operation can proceed. To close a sub-tree,
the target physical instances are first brought
up to date at the root of the tree by
issuing copies from the existing physical
instances containing valid data. (We
discuss the implementation of these copies
later in this section.) If the targets are
already valid instances, then nothing is done.
The close operation then proceeds down the
sub-tree being closed and issues copies from
any instances containing dirty data to
the target instances. Since the most
recent dirty data is always at the lowest
levels of the open sub-tree, by traversing
from top-to-bottom, we guarantee that dirty
versions of data are applied in the correct
order to the target instances. Once the close
operation is complete, the target instances
are all registered as valid versions of data
at the node at the root of the close operation.
If any dirty data was found, all previous
valid physical instances at the root that
were not targets are invalidated. After the
close operation is complete, we record that
it has been performed so that any other tasks
in the same epoch will not need to repeat it.
Figure~\ref{fig:exclose} shows an example close
operation from the circuit simulation that occurs
between the {\tt update\_voltages} and
{\tt calc\_new\_currents} tasks. The first instance
of the {\tt calc\_new\_currents} tasks to premap will
be the one that triggers the close operation. In this
particular close operation, all of the data in the
{\tt p\_shr\_nodes} partition must be copied back
to at least one instance of the {\tt all\_shared}
logical region. Performing this close operation
is a necessary precondition for opening the
{\tt p\_ghost\_nodes} partition that is necessary
to premap all of the {\tt calc\_new\_currents}
ghost logical regions. In this particular example,
the mapper decided to create two target physical
instances of the {\tt all\_shared} logical region
(shown in red and green), so copies are issued from
all physical instances in logical regions of the
{\tt p\_shr\_nodes} partition. After the copy
operations are complete, both instances of the
{\tt all\_shared} logical region will contain
identically valid versions of the data.
\begin{figure}[t]
\centering
\includegraphics[scale=0.9]{figs/CloseOperation.pdf}
\caption{Example Close Operation from the Circuit Simulation\label{fig:exclose}}
\end{figure}
One important detail regarding the premapping
traversal is that it must always be performed
on the same node as the parent task before any
physical region tree meta-data is moved to
remote nodes for mapping (see
Chapter~\ref{chapter:distributed} for more
details). The reason for this is that we
need to be sure that close operations are
performed exactly once, as the updates that
they perform to the region tree meta-data
must be serializable. Therefore, since all
sub-tasks start on the same node as the
parent task, by requiring them all to complete
the premapping traversal on the parent task
we can be sure that close operations are
only performed one time. Requiring premapping
traversals to be done on the parent node
also has the added benefit of reducing the
amount of meta-data that must be sent to
other nodes for remote mapping as only
the sub-trees of logical regions requested
by the region requirement are needed for
the remaining stages.
\subsection{Mapping Traversal}
\label{subsec:maptraversal}
The second traversal that is performed is
the {\em mapping traversal}. The purpose
of this traversal is two-fold: first, in
the case of projection region requirements
it traverses from the region tree node where
privileges were requested to the region tree
node where mapping will take place, and
second, it actually selects which physical
instance to use (or re-use) for mapping the
region requirement without making it valid.
Unlike the premapping traversal
that was required to be performed on the
origin node of the task, the mapping
traversal can be performed remotely. To
map remotely, the necessary meta-data for
the physical region tree must be moved
to the target node. We discuss movement
of meta-data for remote mapping further
in Chapter~\ref{chapter:distributed}. We
now cover the two phases of the mapping
traversal in more detail.
The first phase of the mapping traversal
computes the path from where the original
region requirement requested privileges
to where the individual task's region
requirement is requesting privileges. For
individual tasks and other operations
that use normal region requirements, this
path consists of a single node and requires
no work. However, for index space tasks
that can use projection region requirements
(see Section~\ref{subsec:indextasks}), this
path can be arbitrarily long.
Figure~\ref{fig:mapaths} shows the difference
between the premapping and mapping paths for
a projection region requirement of the
{\tt update\_voltages} task from the circuit
simulation. The premapping traversal to the
{\tt p\_shr\_nodes} partition is done exactly
once for the index space task. However, the
mapping traversal is done individually on
a point-wise basis from the {\tt p\_shr\_nodes}
partition to the logical regions on which
each point task is requesting privileges.
\begin{figure}[t]
\centering
\includegraphics[scale=0.9]{figs/PremappingTraversal.pdf}
\caption{Premapping and Mapping Paths\label{fig:mapaths}}
\end{figure}
The one important invariant that is different in
the mapping traversal from the premapping
traversal is that no close operations will
need to be performed as part of the mapping
traversal. By using projection region requirements
to scope an upper bound on the privileges
requested by all tasks in an index space
task launch, we guarantee that all the necessary
close operations are computed and performed
as part of the index space task's premapping,
and therefore no close operations need to
be performed for point tasks within the
index space task. As part of the mapping
traversal, the necessary sub-trees within
the region tree physical state are opened
to reflect that there are potential users
in the sub-tress.
After arriving at the destination node,
the traversal then attempts to either find or
create a physical instances of the target
logical region to hold the data necessary
for satisfying the region requirement.
Determining the physical instance to select
requires input from the mapper since both
the placement of the physical instance as
well as its layout can impact the performance
of the application. We therefore ask the
mapper to assign a ranking of memories and
set of properties (e.g. layout information)
for all region requirements prior to performing
any of the mapping traversal by invoking
the {\tt map\_task} mapper call. The mapper
decorates the region requirements for the
task with its chosen ranking of memories and
required set of constraints on the layout
of the physical instance. Based on these
requirements, the runtime then sequentially
walks through the list of memories, and for
each memory attempts to either find or
create a physical instance that meets the
constraints. When looking for a physical
instance that meets the constraints, the
runtime can select any instance from the
current node, or from any of the nodes above
it. In the process of selecting nodes
from further up in the tree, we must be careful
to invalidate fields from higher nodes
when those same fields for instances lower
in the tree contain dirty data. It is safe
to re-use the instances, but it will require
additional data movement to bring the fields
up-to-date. If no instance can be found or
created for any of the region requirements
then the mapping stage fails, and the mapper
is notified with the {\tt notify\_failed\_mapping}
mapper call.
It is important to note that after the mapping
traversal is complete, we do not actually
issue any data movement operations for
updating physical instances. Instead this
is done in the third traversal when we register
the physical instances (see
Section~\ref{subsec:regiontraversal}). The reason
for this discrepancy is that in some cases, we
may only discover that a mapping has failed
after we have completed the mapping traversal
for several other region requirements. In these
circumstance, we need to free any physical instances
that we have created as part of the mapping
traversal in order to avoid resource deadlock.
Rolling back these allocations would be complicated
if we had already issued copy operations to bring
the instances up-to-date. Therefore, we defer
issuing any of these operations until we know for
sure that the mapping will be successful for
all region requirements that a task requested.
In some cases, the mapper for a task may determine
that it does not require the creation of a physical
instance for a region requirement, and instead
only needs to pass privileges for the task to
further sub-tasks. Under these circumstances we
allow mappers to request {\em virtual mappings}
of a region requirement, which will not create
a physical instance. Virtual mappings trivially
succeed in the mapping traversal phase. We discuss
the implications of virtual mappings in more
detail in Section~\ref{subsec:virtualmap}.
\subsection{Region Traversal}
\label{subsec:regiontraversal}
Once we have completed the mapping traversal
for all region requirements for a task, we know
that the mapping stage of the pipeline will
be successful. At this point it is safe to
issue the necessary copy operations for ensuring
that the physical instance selected for each
region requirement has the correct version of
the data for the requested fields of the logical region.
To issue the necessary copies required for making
the physical instances valid copies of the data,
we first determine the set of fields that are invalid.
It is possible that the instance that we have selected
is already valid for all the fields at the target
logical region, in which case no copies need to be
issued. Its also possible that one or more fields
are invalid, and therefore must be updated.
To perform the updates we first compute the
set of {\em valid instances} for each field. The
valid instances are the instances that contain
valid data for the specific fields that
need to be updated from the perspective of the
target logical region. When computing the set of
valid instances, it is legal in some cases to
use physical instances from logical regions
higher in the region tree. Determining when
instances from logical regions higher in the
region tree can be used is non-trivial due to
the existence of buffered dirty data within
the physical instances. To compute the valid
instances from the perspective of a logical
region $R$, we start at $R$ and traverse
back up the tree. As we traverse up, we maintain
a field mask that describes the {\em search fields} for
which we are looking for a valid instance. Before
traversing up the next node, we remove any fields
that contain dirty data at the current node from the
search field mask, thereby preventing us from considering
instances higher in the region tree as valid. At each
node we look for physical instances that contain
valid data for the target field. We record all
physical instances that have valid data for any
of the fields for which we are searching. We
maintain a field mask for each of these physical
instances; the field mask records the subset of fields for
which they contain valid data. When the search field
mask is empty, or we reach the region at the top
of the parent task's privilege context, we can
return the set of valid instances.
Figure~\ref{fig:validtrav} depicts an example search
for valid instances in a region tree for two fields
$A$ and $B$. Starting from the query region,
the traversal proceeds up the tree. Initially both
the $A$ and $B$ fields are in the search field mask.
Instance $I_4$ contains a valid copy of $A$ and is
added to the result set. We then progress to the
parent logical region where we discover instance
$I_3$ that contains a valid instance for both $A$
and $B$ which is added to our result set. Before
traversing the grandparent region, we first must
remove field $A$ from the search field mask because it
is a dirty field in the parent logical region. This
indicates that any instances in ancestor regions will
contain stale data for field $A$. Therefore, when we
arrive at the grandparent region, we can safely add
instance $I_2$ to the result set for field $B$.
However, we cannot add either $I_1$ or $I_2$ for
field $A$ because $A$ is no longer in our search
field mask. Ultimately, our result set indicates that
instances $I_3$ and $I_4$ contain valid data for
field $A$ and instances $I_2$ and $I_3$ contain
valid data for field $B$.
\begin{figure}[t]
\centering
\includegraphics[scale=0.9]{figs/ValidViews.pdf}
\caption{Example Valid Instance Views Computation\label{fig:validtrav}}
\end{figure}
Once we have the valid set of instances for
all the fields we need to update, we can issue
the copy operations to update our target
instance. In many cases, there may be multiple
physical instances that can serve as source
instances for the copies. Since the choice
of sources may impact performance, we consult
the mapper when necessary about selecting the
sources for copy operations. If there are multiple
valid instances, we query the mapper
to rank the memories that should serve as
sources for the copies. We invoke the mapper
by calling the {\tt rank\_copy\_sources} mapper
call and provide the mapper with a list of
the memories that currently contain valid
physical instances and the memory of the
destination physical instance. The mapper
returns an ordered ranking of the memories
or a subset of the memories in which case
the other memories are appended to the ranking
in a random order to maintain correctness\footnote{
Even if the mapper does not rank all memories, the
runtime still needs to guarantee that copies get
issued to update all fields with valid data, hence
the necessary appending to the list.}. Using this ranking,
the runtime then issues the necessary update
copies from physical instances based on the
ranking of memories. Only one copy needs to
be issued for each field. Where possible,
copies are issued jointly as gathers from
multiple physical instances over multiple
fields to afford the low-level runtime
maximum visibility in optimizing copy
routines and fusing together data movement
to ensure bulk data transfers.
Now that we have an up-to-date physical instance
at the target logical region, we lastly
need to flush any dirty data from below in
the tree up to the current logical region.
We examine all of the open children to
determine if they have any dirty data. If
they do we issue a close operation with
our selected physical instance as the target.
This ensures that any dirty data below in
the tree is properly added to our physical
instance. The privileges requested by
the region requirement determine if the
sub-trees can be left open: read-only privileges
permit sub-trees to remain open while read-write
and reduce privileges do not since these privileges
invalidate the instances in the
sub-trees. After any close operations
are performed we can register our new
physical instance as a valid physical
instance at the current node in the
region tree. If any dirty data was flushed
up from sub-tress, then we update the
set of dirty fields at the current node
and invalidate any other physical instances
at the current node (since they are no
longer valid). We then add our physical
instance to the list of valid physical
instances for the fields requested by
our region requirement.
\subsection{Physical Instance Traversal}
\label{subsec:instancetraversal}
After we update all the field data for
each physical instance that is going to be
used for a task, we now need to
determine the set of dependences that
must be satisfied before we can safely
use each physical instance. These dependences take
the form of low-level runtime events
corresponding to other tasks and operations
that are already using the physical
instances. Once we have computed the set
of event preconditions, we can merge
them together into a single low-level
runtime event and launch the corresponding
low-level runtime task with the proper
event precondition.
To determine the event preconditions for
a new user, we perform a traversal over
the {\em instance view tree} for the target
physical instance. An instance view tree is a
data structure that mirrors a region tree and
tracks users of a physical instance from the
perspective of different logical regions. We
cover the details of instance view trees in
further detail in Section~\ref{subsec:instviews}.
Figure~\ref{fig:instrav} shows the instance views
in the instance view tree that need to be checked
when performing a physical instance traversal for a
new user of a physical instance. Views highlighted
in red must be visited to check for dependences, while
those in blue can be skipped. Specifically, the
user must visit the instance view at which it will be
registered as well as all ancestor views and all
sub-tree views. While it is possible for there to
exist interfering users from the blue instance views,
they will be detected in the ancestor parent views
because all users are registered in all their ancestor
instance views as well (see Section~\ref{subsec:instviews}
for more details). At each instance view that is
traversed, the user being registered examines all
the currently registered users for interference
based on the same criteria as was presented in
Section~\ref{sec:noninterference}. To aid in
discovering disjointness on logical regions
of users registered in ancestor instance views,
we also annotate users with which sub-tree they
came from, allowing region disjointness to be
determined without traversing any sub-trees.
The output of the instance view traversal is a
set of low-level runtime events that describe
all the interfering users in the instance view
tree. The runtime can then use this set of events
to construct a single low-level runtime event
that serves as the precondition for an operation
using the physical instance. Once the
analysis is complete, the user is itself registered
in the tree with its termination event. The user
is registered in all instance views from the view
in which it is requesting privileges up through all ancestor
instance views to the root of the instance view tree.
\begin{figure}[t]
\centering
\includegraphics[scale=0.9]{figs/InstanceViewTraversal.pdf}
\caption{Instance View Tree Traversal\label{fig:instrav}}
\end{figure}
After we have completed the physical instance
traversal for each of the region requirements
of a task, we have the set of all low-level
runtime events that must trigger before the
task can run. Using the low-level runtime
interference, we merge all these events together
and launch the task with the computed event
as a precondition. The event that is returned
is used as the event that marks when the task
will be done executing and we use it for registering
the user in the instance view tree.
One important detail is that the same instance
view tree traversal algorithm described here is
also used in Section~\ref{subsec:regiontraversal}
to compute the preconditions for issuing copy
operations to and from physical instances. Both
the source and destination region trees are
analyzed. Copies take the form of a special
kind of user with exclusive coherence and
either read-only, read-write, or reduce
privileges depending on the kind of copy being
performed. The low-level runtime also takes
an event precondition on copies and returns
an event for describing when the copy will
be finished. Using this low-level event,
copies also register themselves as
users in the instance view trees to ensure
correct dependence analysis when computing
the set of precondition events for any
future user.
\section{Physical Region Tree Data Structures}
\label{sec:phystree}
Much like the state required for the logical
dependence analysis described in
Chapter~\ref{chapter:logical}, our physical
region tree traversal for performing the mapping
analysis also requires the storage of meta-data
on nodes in the region forest. While some of the
the meta-data stored on individual nodes is the
same as the logical meta-data, some aspects
are different. We describe the specific
state required for each node in
Section~\ref{subsec:physstate}.
In addition to storing state on each node, the
physical region tree traversal will also require
two new kinds of data structures:
{\em instance managers} for tracking
physical instances and {\em instance views}
for tracking the operations that
are using the physical instances.
We provide descriptions of these data
structures in Section~\ref{subsec:instmanagers}
and \ref{subsec:instviews} respectively.
Similar to the logical dependence analysis data,
physical region tree traversal meta-data is
decorated on nodes of the region tree forest.
To differentiate the state for different
task executions, we again use the notion of
contexts, initially introduced in
Section~\ref{subsec:logicalctx}. The same context
ID that is allocated for non-leaf parent tasks
for performing dependence analysis is used for
performing the physical region tree traversal
for sub-tasks launched within a parent task
(with the exception for virtual mappings discussed in
Section~\ref{subsec:virtualmap}). Each region tree
node maintains a separate array of physical meta-data
objects that can be directly indexed by the
context ID, allowing traversals from different
contexts to occur independently.
\subsection{Region Tree State}
\label{subsec:physstate}
On each node of the region tree, several different
meta-data structures are maintained. First, similar to
the logical state data, each physical state for a
node tracks which of the sub-trees are {\em open}.
There is a slight difference in the definition of open
for the physical state of the region trees: a sub-tree
is said to be open for a field $f$ if it contains at
least one valid physical instance for $f$ at some
node in the sub-tree. Note that this is different
from the logical state where a sub-tree was considered
open if there were operations registered in the sub-tree.
Unlike the logical region tree state, where both
field and the privilege modes for open sub-trees
were tracked, the physical region tree traversal
only needs to track which field of sub-trees are
open. The reason for this is that the logical
region traversal memoized the results of the privilege
analysis when the close operations were recorded
for each operation. By applying the close operations
as part of the premapping traversal discussed in
Section~\ref{sec:premapping}, sub-trees will be
implicitly open with the proper privileges (e.g.
multiple sub-trees will only be open if multiple
read-only or reduction tasks are active in those
sub-trees). Open information for individual fields
are still required to know whether a sub-tree is
actually open or whether it was closed by another
task in the same epoch during its premapping
traversal.
The second piece of meta-data state stored on each
region tree node is a field mask that tracks dirty
information. The dirty field mask records which
fields at the current region tree node contain
dirty information (e.g. have been written to).
When a region tree is going to be closed, dirty
information is necessary for computing which
data movement operations are required to propagate
information back to higher levels in the region tree.
The final component of the physical state at
each region tree node is a list of {\em valid
instances}. This list contains a list of
instance view objects (see
Section~\ref{subsec:instviews}) that represent
physical instances that contain a valid version
of data from the local region tree node. Each
valid instance is accompanied by a field mask
that describes the specific fields for which
the physical instance contains a valid instance
of the data. It is often possible that a
physical instance only contains valid data for
a subset of the fields for which it has
allocated space.
\subsection{Physical Instance Managers}
\label{subsec:instmanagers}
When physical instances are allocated, the
Legion runtime creates a {\em physical instance
manager} to keep track of the properties of the
physical instance. These properties include the
memory in which the instance is allocated, the location
in that memory, the layout of the fields, and the
linearization of the index space.
When copies are issued either to or
from a physical instance, the physical instance
manager is responsible for generating the necessary
instructions for directing the low-level runtime
about how to perform the copy. For instances with
hundreds or thousands of fields, this information
can be expensive to compute. We therefore cache
the meta-data required for issuing copies so that
it can be re-used when similar copies are requested.
Since instance managers are often large objects that
contain no context specific information, we deduplicate
them across physical contexts and ensure that at
most one physical manager exists for each physical
instance on a single node. We describe how physical
managers are lazily migrated between nodes in
Chapter~\ref{chapter:distributed}. The per-node
uniqueness property guaranteed by physical instance
managers also serves a useful purpose in our
garbage collection mechanism for physical instances
that we describe in Chapter~\ref{chapter:garbage}.
\subsection{Physical Instance Views}
\label{subsec:instviews}
While instance managers are used to describe the
properties of a physical instance, we use a
separate collection of objects to track the
users of physical instances. An {\em instance view}
is an object that records all operations (e.g.
tasks, copies, inline mappings) that are
using a specific physical instance from the
perspective of a specific logical region. A
physical instance initially created for a logical
region $R$ can serve as a valid physical instance
for any logical region in the sub-tree of $R$.
In some cases there might be multiple tasks using
the same physical instance from the perspective of
two disjoint logical sub-regions. In cases where
these tasks might be interfering on everything except
region disjointness, it is important to be
able to detect these cases to avoid false dependences.
By providing different instance view objects for
the same physical instance, we can maintain precise
information about how different users are using the
same physical instance and accurately detect when
dependences need to exist between users.
An instance view tracks the set of users of a
specific physical instance from the perspective of
a single node in the region tree. The collection
of instance view objects for a specific physical
instance are therefore organized as a tree that
mirrors the shape of the region tree from which
the physical instance originates. For physical
instances that contain valid data in multiple
contexts there is a separate set of instance
views for each physical context within the region
tree to ensure that users from different contexts
do not need to perform non-interference tests
against users from other contexts\footnote{
This is sound because of the hierarchical privilege
property: if two tasks are non-interfering then so
are any children they generate.}. All users are
registered with instance views from the region
in which they requested privileges to the
root of the instance view tree. (Note
the root of the instance view tree is
always the logical region that initially
created the physical instance.) Registering
users in this way will enable the simple
algorithm for finding interfering users
described in Section~\ref{subsec:instancetraversal}.
When a user is registered we record the privileges,
coherence, and a field mask for describing which
data is being accessed and how it is being accessed.
Each user also records a low-level runtime
event that will be triggered upon completion,
allowing later tasks to know when the user
is done using the physical instance.
As an important optimization for tracking the
users from the perspective of an instance view,
we also make use of {\em epoch lists}, similar
to the ones introduced in Section~\ref{subsec:epochlists}.
Epoch lists provide a mechanism for reducing the
number of users that must be tracked by each
instance view object. Identical to the epoch
lists of logical state objects, each instance view
object maintains two epoch lists: a current epoch
list and a previous epoch list. Whenever an added
user dominates all of the previous users for a
specific set of fields, we can remove the
users of those fields in the previous epoch list
and filter the users from the current epoch list
into the previous epoch list. If the user does
not dominate the users in the current epoch list,
then it must also check for dependences in the
previous epoch list and then add itself to the
current epoch list. The same observable condition
on domination applies: a user performing interference
tests must observe at least one other user of
a field before being able to safely conclude that
it dominates that field.
Due to the large number of users that can
accumulate in the current and previous epoch
lists, the runtime supports an important optimization
for reducing the number of users. By default,
lists are only modified by the standard instance view
traversal protocol described in
Section~\ref{subsec:instancetraversal}. However, users
can also set an upper bound for the number of entries
in the epoch lists. Whenever the upper bound is reached,
the runtime performs three operations to compact the
epoch lists: user deduplication, privilege merging,
and field merging. The first compaction pass looks
for users from the same operation that have been
fragmented on fields, due to epochs changing for different
fields at different times, resulting in multiple
versions of a user ending up in the previous epoch list.
The second compaction pass looks for users from the
same epoch on the same fields, and merges these users
together into a single user entry. Merging users also
requires generating a new event that represents all
the previous user events which is then stored in the
entry for the user. Both of the first two compaction
passes still maintain the precision of the
analysis. The third operation reduces the precision
of the analysis and therefore may be optionally disabled
with the understanding that it will cause the upper
bound on the epoch lists sizes to be a soft upper
bound that may be exceeded. In the third
compaction pass, users of different fields with the
same privileges are merged, again resulting in a single
merged event to represent multiple users. The resulting
users must contain a union of all the fields that
were merged, potentially resulting in false dependences
being created between operations, but at the benefit
of a hard upper bound on epoch list sizes.
\section{Special Cases}
\label{sec:physicalcases}
Unlike the dependence analysis stage of the pipeline
that maintains a relatively homogeneous implementation
independent of the privileges and coherence of
region requirements, many of the special features of
the Legion runtime must be handled differently in the
mapping stage of the operation pipeline.
The reason for this is simple: the mapping stage
is where the logical description of Legion programs
is targeted at a specific target architecture. It is
therefore this stage that must handle the cross-product
of interacting Legion features and how they are mapped
onto the hardware. We discuss the implementation of
several of the more important features in this section.
\subsection{Supporting Reduction Instances}
\label{subsec:physreductions}
One of the important features of the Legion programming
model is that reduction privileges permit tasks to
be non-interfering. In order to support this feature,
Legion takes advantage of specialized {\em reduction
instances} provided by the low-level runtime \cite{Realm14}.
Reduction instances are a special kind of physical
instance that only support reduction operations;
arbitrary reads and writes are not permitted. The low-level
runtime provides two kinds of reduction instances:
reduction-list instances and reduction-fold instances.
Reduction-list instances buffer reductions as a
list, while reduction-fold instances maintain
separate entries for each row in an index space, but
at the significantly smaller cost of only needing
to store the {\em ride-hand-side} (RHS) value
of the reduction operator, which is commonly smaller
than the target field of the reduction. If multiple
reductions are applied to the same row, a
reduction-fold instance can perform a {\em fold operation}
to reduce the two intermediate values together, thereby
reducing the amount of information that must be stored
in the reduction instance\footnote{A fold operation
has the type $RHS \rightarrow RHS \rightarrow RHS$, reducing
two elements of the $RHS$ type to one. Fold operations
are required to be associative and commutative.}.
Reduction-list instances work better for tasks
that are performing sparse reductions, while reduction-fold
instances work best for tasks performing dense reductions
(and when a specialized fold operator is not defined for
the reduction). Since the choice of reduction instance
can have an impact on performance, the decision of the
type of reduction instance to use is made by the
mapper as part of the {\tt map\_task} mapper call.
For all region requirements that request reduction
privileges, the runtime currently requires that a
reduction instance be made. This restriction is what
permits tasks requesting reduction-only privileges
on otherwise interfering regions to both map and
execute in parallel. To track reduction instances,
each of the physical state objects for tracking
physical traversal meta-data (described in
Section~\ref{subsec:physstate}), is updated to
track all the fields that currently have outstanding
reductions, as well as a collection of reduction
instances available for the current region tree node.
When close operations are performed, reductions are
always flushed back up to the target physical instance
after all other dirty data has been copied back. This
implementation stems from a very important invariant
maintained by the logical region tree dependence
analysis: whenever sub-trees are open in reduction mode,
there can never be any dirty data below the reduction
operations. If dirty data does exist in a sub-tree before
a reduction is mapped, a close operation is always performed
before the tree is opened for reduction operations.