TODO.txt


=== TODOs.  Last updated 17 Mar 2020. =====================================

=== Tidying ===============================================================

- (should do for MVP): fn CFGInfo::create::dfs: use an explicit stack instead
  of recursion.  As it stands, we could get a stack overflow if given a fn
  with several thousand BBs, all in a chain.

=== Interface =============================================================

- Important: have a way for clients to specify spill slot relative ordering and
  alignment issues.  (This is actually required for correctness).
  https://github.com/bytecodealliance/regalloc.rs/pull/35 has the current
  proposal.

- Require clients to specify the number of virtual regs in the input, so that
  we can use DenseSet<VirtualReg>.

- Allow clients to specify an is_leaf_function flag, so we can switch
  allocation ordering to favour leaf functions in such cases.

=== BT allocation quality =================================================

- Implement stack slot move coalescing.  This is obviously going to be
  important for inputs with very large numbers of simultaneously live values.

=== Allocator run time (analysis phase, algorithmic) ======================

Estimated most-important-first:

- (Longer term) merge_RangeFrags: can we do even better?

- (Longer term) Maybe remove dominance computation and the existing loop
  finder completely, and instead use a standalone loop-finder algorithm, eg
  https://lenx.100871.net/papers/loop-SAS.pdf, as suggested by Chris.

=== Allocator run time (both BT and LSRA, algorithmic) ====================

- Remove scan at end of allocation that figures out which real regs got used.
  Instead harvest that information from the individual algorithm's rreg-state
  tables at the end of allocation.  Doing this will remove a pass over all
  instructions so it seems like a significant win.

- Inst rewrite loop: move cursors forwards at Point granularity so we don't
  have to repeatedly re-scan the groups looking for particular LR kinds?

=== Allocator run time (LSRA, algorithmic) ================================

=== Allocator run time (BT, algorithmic) ==================================

- Inst rewrite loop: don't clone map_defs; just use it as soon as it's
  available.

- Core allocation loop: see if we can reduce the amount of wasted effort
  checking to see whether a VirtualRange can be accommodated by a register, by
  reducing the number of registers we have to try.

- Spill slot allocator: don't repeatedly search all slots if likelyhood of
  them being able to take a new group of VirtualRanges is low.

=== Allocator run time (representational) =================================

- Collect typical use data for each Set<T> instance and replace with a
  suitable optimised replacement.  Maybe DenseSet<T: ToFromU32> and
  SparseSet<T: ToFromU32>.

- Iterative liveness analysis: for DenseSet<T>, considering implementing a
  special (X `minus` Y) `union` Z operation, so as to avoid having to create
  intermediate (X `minus` Y) sets.

- Ditto FxHashMap (if we have to have it at all)

- Replace SortedFragIxs with something more efficient.  Does it even need to
  be sorted?  Maybe it should be changed to a SparseSet<RangeFragIx>.

- Profile with DHAT and selectively change Vec to SmallVec as needed.  Do not
  change Vec to SmallVec without supporting DHAT profile data; this can make
  things run more slowly.

- (Longer term) investigate the feasibility of removing the fragment
  environment and instead storing fragment range data directly in what is
  currently RangeFragIx.

=== Functionality =====================================================

Post-MVP:

- BT: Live Range Splitting (but see next point)

- BT: rewrite the main live-range allocation loop, so as
  to make it perform copy coalescing more robustly.

  Currently the loop processes one VirtualRange at a time, and makes a
  best-effort to honour the coalescing hints available.  However, this is all
  rather ad hoc.  It would be better to change the structure so that the basic
  unit of allocation is a *group* of VirtualRanges, where VirtualRanges are
  put into the same group if they are connected by V-V copies.  Hence:

  * coalescing analysis computes:
    - for each VLR, a weighted set of register prefs, based on V-R and R-V
      copies only
    - the grouping of VLRs, based on V-V copies

  * the allocation loop then allocates/evicts entire groups at a time.
    Allocating an entire group at once guarantees that all V-V copies between
    VLRs in the group will disappear, since by definition all group members
    have the same real register.

  * when allocating a group, draw up a weighted list of preferred candidates
    by checking the V-R/R-V derived prefs of all its members, and try first to
    use those (as the present logic does).

  * if a group can't be allocated, and contains more than one VLR, then it is
    split up into multiple groups, each containing one VLR, and these are put
    back into the allocation queue.  Note that this doesn't have to happen in
    a single step; it would be ok to gradually split the groups into smaller
    groups, with the proviso that the eventual end-point of the process must
    result in each group containing only a single VLR.

    The result of splitting up a group is that we can no longer guarantee to
    remove V-V copies between the group's VLRs, but that's unavoidable.

  * if a group can't be allocated, and contains only a single VLR, then either
    - it must be spilled, as at present
    - it must be split into multiple smaller VLRs, each of which is put into
      its own group. (Not yet implemented).

  This design has some nice properties:

  * It maximally coalesces virtual groups to the extent it can.

  * Imagine the situation where a group has a real-reg preference, as a result
    of one of its members (VLRs) having that preference.  Suppose further that
    the group cannot be allocated, and must be split, so it is split into two
    smaller groups.  The group that does *not* inherit the preferred register
    is now unconstrained by the preference, and the group that does inherit
    the preference is smaller.  Both of these effects potentially remove
    interferences and so increase the chances of the two smaller groups being
    successfully allocated.

  * It integrates nicely with the idea of splitting.  If a single-element
    group (a VLR) cannot be allocated, we could choose to split it into
    multiple single-element groups.  The only difficulty is to decide how to
    place the splits in such a way that it is unnecessary to add explicit
    copies to the instruction stream (since that sounds complex and expensive)
    and yet is profitable.

    The no-changes-to-the-insn stream idea may be a tradeoff that is worth
    persuing in an allocator intended for use in a JIT.  It reduces the
    splitter's possibilities but means it doesn't have to deal with the
    complexity of adding copies at the dominance frontiers, etc.  If done
    carefully it could mean we never need to change the instruction stream at
    all.  The restriction basically has the effect that any split point must
    divide the blocks in the VLR into two disjoint groups, those dominated by
    the "split point" and those postdominated by the "split point", so that in
    effect all "traffic" in the VLR flows through the split point.

    Another opportunistic-split possibility is to add two splits at the start
    and end of a single basic block that has high register pressure, perhaps
    due to a call.  Then we wouldn't even have to bother checking the
    dominates/postdominates condition.

    These are complex tradeoffs to evaluate.  Perhaps in 2H2020.

=== end ===================================================================