-
Notifications
You must be signed in to change notification settings - Fork 0
Tiling virtual memory for TLBs and Caches
Tiling is a useful technique, that can improve memory system performance wrt TLBs and caches.
Performing tiled address transformations in hardware is possible. This page discussions issues related to doing so.
Consider a large 2D array, allocated in memory in row-major order. In such an array, consecutive memory addresses correspond to elements of the same row of the array. (Except where one row ends and the next begins. Array dimensions are often padded so that does not occur within the same cache line, and sometimes within the same (small 4K) page of virtual memory.)
However, much software accesses "nearest neighbors". E.g. a computation on A[i,j] may access A[i-1,j], A[i+1,j], A[i,j-1] and A[i,j+1]. An N by N neighborhood in a standard, untiled, manner will involve accessing at least N different cache lines, and N different pages if the row size in bytes exceeds the page size, multiplied by the number of cache line or page granules per row of the array.
For very large neighborhoods, such as those found in some image processing codes, the untiled neighborhood by itself may be comparable to the size of the cache or TLB.
In other situations the neighborhoods by themselves may fit, but software that traverse an entire row at a time may thrash the cache. loop traversal may not. Loop optimizations such as cache blocking can improve this situation, but are not always possible.
There have been software libraries that allocate such arrays in a tiled manner - instead of allocating in row-major order, they change the mapping of array elements to memory locations. Thereby minimizing the number of cache lines or TLB entries coveri9ng a given 2D neighborhood. But this incurs software overhead, and AFAIK is not supported by any common programming language standard.
The hardware address transformations described here perform the necessary address transformations that such software systems provide - thereby reducing overhead.
TBD: Obvious.
Tiling is common, e.g. in GPUs, graphics framebuffers. Usually in the form of special hardware that transforms a pre-tiled physical address to a post-tiled physical address - but only for certain physical address ranges. Therefore, not generally useable.
Tiling can be made more generally useful by performing in on virtual addresses, for ordinary DRAM. OS support, of course, would be required.
(Or, similarly, by making the OS aware of physical memory ranges that can be tiled, so that the OS can map virtual address pages to it. But since such physical address ranges are restricted, both in memory size and tiling parameters, this is less than generally useful.)
To tile an address, whether physical or virtual, one typically swaps two bitfields of the address.
Franklin92 illustrates this as below:
Untiled virtual address | |||
---|---|---|---|
Yaddr | Xaddr | ||
Yupper | Ylower | Xupper | Xlower |
Tiled virtual address | |||
Tile Number | Tile Offset | ||
Yupper | Xupper | Xlower | Xlower |
I extend this as follows:
Untiled virtual address | ||||||
---|---|---|---|---|---|---|
Base | Yaddr | ... | Xaddr | ... | ||
Yupper | Ylower | Xupper | Xlower | |||
Tiled virtual address | ||||||
Base | Tile Number | ... | Tile Offset | ... | ||
Yupper | Xupper | Xlower | Xlower |
The bitfields swapped are selected according to the dimensions of the tile desired.
Swapping bitfields around the page boundary creates tiled virtual pages in the TLB.
Swapping bitfields around the cache line boundary tiles cache lines.
Hardware bitfield swapping within addresses (virtual or physical) is straightforward.
It is similar to the hardware address transformation required for pointer masking. E.g. as has already been proposed for RISC-V.
However, any such address transformation worsens critical paths.
TBD: discuss which transformations can be implemented where in pipeline - e.g,. before access to a PIPT cache, after, etc.
As mentioned above, tiling of physical addresses is quite common. One or more sets of such parameters are stored with the tiling logic, and are applied after determination that an address is within the tiled physical address range.
To perform tiling on virtual addresses, it is natural to store the tiling parameters in entries within the page tables that map virtual addresses to physical addresses.
For address width 64, this is approximately or six bitfields. restrictions on the positions and width of the tiles and reduces storage required. Nevertheless, 24 or slightly fewer bits is quite expensive to provide in a page table entry that is typically the virtual address width.
It is desirable to keep the (small 4K) leaf page table entries 64 bits.
But it may be less onerous to have larger page table entries for intermediate nodes, such as the next level page directory entries, that on many systems map to 512 PTEs. this constrains the granularity of such data structures, but that is probably acceptable if they are large arrays.
Similarly, systems create non-radix pages using adjacent small/leaf PTEs. e.g. the NAPOT pages proposed for RISC-V. The NAPOT encoding frees up bits that might be used for advanced purposes, such as tiling parameters. But there are many customers for such bits, and there are not very many of them, so I consider it unlikely that these bits can be used for tiling parameters.
The number of bits required for tiling parameters in the page table entries can be reduced by a level of indirection. E.g. a 4 bit field pointing to a table of 16 tiling parameter entries.
Note that the tiling parameters, i.e. the bitfield locations to BSWAP, must be stored both somewhere in the page tables, and also in TLB entries.
- [Franklin92]
- James Franklin, Kodak: "Tiled Virtual Memory for Linux", Summer '92 USEnix, pp 99-, https://www.usenix.org/legacy/publications/library/proceedings/sa92/franklin.pdf