forked from bytecodealliance/regalloc.rs
-
Notifications
You must be signed in to change notification settings - Fork 1
/
TODO.txt
166 lines (119 loc) · 7.66 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
=== TODOs. Last updated 17 Mar 2020. =====================================
=== Tidying ===============================================================
- (should do for MVP): fn CFGInfo::create::dfs: use an explicit stack instead
of recursion. As it stands, we could get a stack overflow if given a fn
with several thousand BBs, all in a chain.
=== Interface =============================================================
- Important: have a way for clients to specify spill slot relative ordering and
alignment issues. (This is actually required for correctness).
https://github.com/bytecodealliance/regalloc.rs/pull/35 has the current
proposal.
- Require clients to specify the number of virtual regs in the input, so that
we can use DenseSet<VirtualReg>.
- Allow clients to specify an is_leaf_function flag, so we can switch
allocation ordering to favour leaf functions in such cases.
=== BT allocation quality =================================================
- Implement stack slot move coalescing. This is obviously going to be
important for inputs with very large numbers of simultaneously live values.
=== Allocator run time (analysis phase, algorithmic) ======================
Estimated most-important-first:
- (Longer term) merge_RangeFrags: can we do even better?
- (Longer term) Maybe remove dominance computation and the existing loop
finder completely, and instead use a standalone loop-finder algorithm, eg
https://lenx.100871.net/papers/loop-SAS.pdf, as suggested by Chris.
=== Allocator run time (both BT and LSRA, algorithmic) ====================
- Remove scan at end of allocation that figures out which real regs got used.
Instead harvest that information from the individual algorithm's rreg-state
tables at the end of allocation. Doing this will remove a pass over all
instructions so it seems like a significant win.
- Inst rewrite loop: move cursors forwards at Point granularity so we don't
have to repeatedly re-scan the groups looking for particular LR kinds?
=== Allocator run time (LSRA, algorithmic) ================================
=== Allocator run time (BT, algorithmic) ==================================
- Inst rewrite loop: don't clone map_defs; just use it as soon as it's
available.
- Core allocation loop: see if we can reduce the amount of wasted effort
checking to see whether a VirtualRange can be accommodated by a register, by
reducing the number of registers we have to try.
- Spill slot allocator: don't repeatedly search all slots if likelyhood of
them being able to take a new group of VirtualRanges is low.
=== Allocator run time (representational) =================================
- Collect typical use data for each Set<T> instance and replace with a
suitable optimised replacement. Maybe DenseSet<T: ToFromU32> and
SparseSet<T: ToFromU32>.
- Iterative liveness analysis: for DenseSet<T>, considering implementing a
special (X `minus` Y) `union` Z operation, so as to avoid having to create
intermediate (X `minus` Y) sets.
- Ditto FxHashMap (if we have to have it at all)
- Replace SortedFragIxs with something more efficient. Does it even need to
be sorted? Maybe it should be changed to a SparseSet<RangeFragIx>.
- Profile with DHAT and selectively change Vec to SmallVec as needed. Do not
change Vec to SmallVec without supporting DHAT profile data; this can make
things run more slowly.
- (Longer term) investigate the feasibility of removing the fragment
environment and instead storing fragment range data directly in what is
currently RangeFragIx.
=== Functionality =====================================================
Post-MVP:
- BT: Live Range Splitting (but see next point)
- BT: rewrite the main live-range allocation loop, so as
to make it perform copy coalescing more robustly.
Currently the loop processes one VirtualRange at a time, and makes a
best-effort to honour the coalescing hints available. However, this is all
rather ad hoc. It would be better to change the structure so that the basic
unit of allocation is a *group* of VirtualRanges, where VirtualRanges are
put into the same group if they are connected by V-V copies. Hence:
* coalescing analysis computes:
- for each VLR, a weighted set of register prefs, based on V-R and R-V
copies only
- the grouping of VLRs, based on V-V copies
* the allocation loop then allocates/evicts entire groups at a time.
Allocating an entire group at once guarantees that all V-V copies between
VLRs in the group will disappear, since by definition all group members
have the same real register.
* when allocating a group, draw up a weighted list of preferred candidates
by checking the V-R/R-V derived prefs of all its members, and try first to
use those (as the present logic does).
* if a group can't be allocated, and contains more than one VLR, then it is
split up into multiple groups, each containing one VLR, and these are put
back into the allocation queue. Note that this doesn't have to happen in
a single step; it would be ok to gradually split the groups into smaller
groups, with the proviso that the eventual end-point of the process must
result in each group containing only a single VLR.
The result of splitting up a group is that we can no longer guarantee to
remove V-V copies between the group's VLRs, but that's unavoidable.
* if a group can't be allocated, and contains only a single VLR, then either
- it must be spilled, as at present
- it must be split into multiple smaller VLRs, each of which is put into
its own group. (Not yet implemented).
This design has some nice properties:
* It maximally coalesces virtual groups to the extent it can.
* Imagine the situation where a group has a real-reg preference, as a result
of one of its members (VLRs) having that preference. Suppose further that
the group cannot be allocated, and must be split, so it is split into two
smaller groups. The group that does *not* inherit the preferred register
is now unconstrained by the preference, and the group that does inherit
the preference is smaller. Both of these effects potentially remove
interferences and so increase the chances of the two smaller groups being
successfully allocated.
* It integrates nicely with the idea of splitting. If a single-element
group (a VLR) cannot be allocated, we could choose to split it into
multiple single-element groups. The only difficulty is to decide how to
place the splits in such a way that it is unnecessary to add explicit
copies to the instruction stream (since that sounds complex and expensive)
and yet is profitable.
The no-changes-to-the-insn stream idea may be a tradeoff that is worth
persuing in an allocator intended for use in a JIT. It reduces the
splitter's possibilities but means it doesn't have to deal with the
complexity of adding copies at the dominance frontiers, etc. If done
carefully it could mean we never need to change the instruction stream at
all. The restriction basically has the effect that any split point must
divide the blocks in the VLR into two disjoint groups, those dominated by
the "split point" and those postdominated by the "split point", so that in
effect all "traffic" in the VLR flows through the split point.
Another opportunistic-split possibility is to add two splits at the start
and end of a single basic block that has high register pressure, perhaps
due to a call. Then we wouldn't even have to bother checking the
dominates/postdominates condition.
These are complex tradeoffs to evaluate. Perhaps in 2H2020.
=== end ===================================================================