This is an implementation of a greedy bottom-up local register allocator for
the intel 8086 CPU. It may be used as part of a simple compiler. In particular,
it can be used in a C compiler whose int
is 16-bit and char
is 8-bit.
To make things interesting and interactive it comes with a code generator capable of generating 8086 assembly code (compilable into DOS .COM programs) from trees representing expressions with 16-bit integer ALU operations, constants and memory loads and stores. A few extra instructions generated by the code check the results of expression evaluation in the output register and the output memory locations, if any.
Thus you can actually execute the generated code and tweak things around to see how your changes affect code generation.
Sample assembly output from the register allocator/code generator:
; 7 (vr4)
; mul (vr5)
; 5 (vr3)
; add (vr6)
; 3 (vr1)
; mul (vr2)
; 2 (vr0)
; ----
; Regs needed (approximately): 3
; --------
; ; vr0
mov ax, 2 ; vr0
mov cx, 3 ; vr1
mul cx ; vr2
mov cx, 5 ; vr3
mov dx, 7 ; vr4
xchg ax, cx ; vr5
mul dx ; vr5
add cx, ax ; vr6
; ; vr6
cmp cx, 41 ; vr6
jne failure ; vr6
N.B. This register allocator will not allocate registers perfectly in all circumstances. It will occasionally generate unnecessary register moves and exchanges. Bear in mind, optimal register allocation is an NP-complete problem.
The intel 8086 architecture has a number of instructions with fixed input and/or output registers and it's not entirely trivial to connect the ins and outs of these instructions. There's obviously a long history of implementing compilers for this and similar architectures, but it's a bit odd that it's hard to find a conceptually simple algorithm like this. The "Dragon" book and many other sources either consider architectures without instructions with such fixed register operands or solve the problem using graph coloring, which for certain reasons may be impractical.
The intel 8086 architecture survives to this day in the modern intel Pentium CPUs that are compatible descendants of the old 8086 and with it survive some of the instructions with fixed in/out register operands. So, the problem persists to this day, although there are fewer limitations to deal with when using newer 32-bit and 64-bit x86 instructions and registers today.
8086 registers by special uses in expression evaluation (pardon the mixture of C and assembly operators, instructions and syntax and ASCII art):
+-- can be freely used in most ALU instructions as src/dst
| +-- can be used as address to access memory
| | +-- has individual 8-bit subregisters
| | | +-- can be sign-extended with cbw
| | | | +-- can be shifted
| | | | | +-- can be shift count
v v v v v v
+-<&|^~ [r] 8bit cbw <<dst <<cnt *dst *src /dvd /dvsr /quo /rem
ax + + + + fix+ fix+ fix+ fix+
bx + + + + + +
cx + + fix+ + +
dx + + + + fix+
si + + + + +
di + + + + +
^ ^ ^ ^ ^ ^
can be product --+ | | | | |
can be multiplicand --+ | | | |
can be dividend --+ | | |
can be divisor --+ | |
can be quotient --+ |
can be remainder --+
Outside of the scope of this work are:
- data types other than 8-bit bytes and 16-bit words (e.g. floats, structs)
- conditional/ternary operators like in C/C++
- function calls and calling conventions (however, a simple implementation is included)
- use of immediate and memory operands directly in ALU instructions (we use separate load/store instructions instead)
These are left as an exercise to the reader.
Before we get to the 8086 specifics, let's describe the algorithm for an architecture, where there aren't instructions with fixed register operands. We'll be building on top of it.
Let's also consider a case, where our expressions consist of only binary
operators (unary operators will be a trivial specialization), which are
realized as 3-register instructions in the CPU, that is, 2 input registers and
1 output register, much like in the MIPS architecture (e.g. sub r2, r3, r4
will do r2 = r3 - r4;
).
Suppose we have an expression tree like this:
op
-
op op
+ |
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
op
and nm
represent operator and numeric nodes in the tree.
Our hardware register allocation will be done recursively from the root towards the leaves:
AllocHRegs(node):
recursively call AllocHRegs() for both input/child nodes
Ensure() that both input/child values are in registers
Free() both input registers
Allocate() one output register
generate an instruction using these 3 register operands
For the numeric leaf node there will only be the last two steps: Allocate() for the output register and the generation of an instruction to load an integer constant into that register.
The utility functions will be:
Ensure(node):
if node's value has no location yet:
Allocate() a register for it and return the register
else if node's value is already in a register:
return that register
else: /* the node's value must have been spilled onto the stack */
Allocate() a register
generate a pop instruction to pop a value from
the stack into this register
return this register
and
Allocate(node):
if there's a vacant register among the N registers the allocator manages:
mark it as holding the node
mark the node as being in this register
return the register
else:
from the N registers managed by the allocator find the register whose
value won't be needed the longest, that is, whose parent/using node
is the most distant
mark the node that that register holds as spilled
generate a push instruction to push the node's value from the
register onto the stack
mark the register as holding the argument node passed into Allocate()
mark the node as being in this register
return the register
and
Free(hardware_reg):
mark the node this register holds as not having a location
mark the register as holding no node
Using this algorithm we will arrive at this code generated for the above tree (let's replicate the tree here):
op
-
op op
+ |
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
mov hr0, 1
mov hr1, 2
xor hr0, hr0, hr1
mov hr1, 3
mov hr2, 4
mul hr1, hr1, hr2
add hr0, hr0, hr1
mov hr1, 5
mov hr2, 6
idiv hr1, hr1, hr2
mov hr2, 7
mov hr3, 8
irem hr2, hr2, hr3
or hr1, hr1, hr2
sub hr0, hr0, hr1
This code needs 4 registers (hr0 through hr3).
If we restrict the number of registers that the allocator can manage to 3, we'll get this instead (let's replicate the tree one more time):
op
-
op op
+ |
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
mov hr0, 1
mov hr1, 2
xor hr0, hr0, hr1
mov hr1, 3
mov hr2, 4
mul hr1, hr1, hr2
add hr0, hr0, hr1
mov hr1, 5
mov hr2, 6
idiv hr1, hr1, hr2
mov hr2, 7
push hr0
mov hr0, 8
irem hr0, hr2, hr0
or hr0, hr1, hr0
pop hr1
sub hr0, hr1, hr0
Note the appearance of the push and pop instructions in the above code and the disappearance of register hr3.
The allocator runs out of registers once it has loaded 7 into hr2 and is about to load 8 into another register. At this point there are 3 partial results occupying all 3 registers:
hr0 = result of +
hr1 = result of /
hr2 = 7
We spill hr0, the result of +, onto the stack to make hr0 available for loading of 8. We choose hr0 to spill because its parent, -, is the most distant (| and %, the parents of / and 7, are closer). When we finally need the result of +, we pop it from the stack.
The relative distance of parent nodes can be obtained by simple recursive assignment of unique numbers to all nodes:
op
-
op op
+ |
op op op op
^ * / %
nm nm nm nm nm nm nm nm
1 2 3 4 5 6 7 8
0 2 1 6 3 5 4 14 7 9 8 13 10 12 11 <- node number / virtual reg
Each node then will have this unique number (we may refer to it as the virtual register) and we can either follow the node's parent link to find the parent's unique number or we may simply store the parent's unique number in the child node instead of storing there the link.
And so in this example of ours, 14 is the largest of the three numbers (14, 13 and 12), which is how we decide to kick the hr0 value out onto the stack.
Spilling the register that has the most distant use guarantees that unspilling will occur in the exact opposite order (LIFO), which lets us use the stack with push and pop instructions. Specifically, with operators being at most binary (not ternary and so on), we'll never spill two sibling nodes, that is, we'll never have to distinguish equally distant nodes.
When implementing this algorithm we'll need a global array of N pointers to the
expression tree nodes, where N is the number of the registers that the register
allocator manages. (This array is known as NodeFromHReg[]
in our code.) By
examining the contents of this array we'll know if a particular hardware
register is occupied by the result of some node. The nodes themselves will
carry the location of their results (that is, it can be one of these N hardware
registers or the stack or nowhere yet). IOW, with such a setup we'll always be
able to tell what's where, by examining the array or a node and we'll be
modifying the array and nodes appropriately to reflect the current allocations
throughout the process.
When evaluating a binary operator (e.g. a + b) we should first evaluate the child/subexpression that uses more registers, then the one that uses fewer. This leads to more effective register use and fewer spills. The logic is simple: before you can evaluate the binary operator, you need to hold the result of one of its children in a register while evaluating the other child. So, if the children need, let's say, 1 and 2 registers each to evaluate, then evaluating first then second will need 3 registers, whereas evaluating them in the opposite order will need 2 registers.
When assigning unique node numbers (virtual register numbers) as described in the preceding section, we can assign them in this order and we can then handle sibling nodes in this order by looking at and comparing their unique numbers.
See Ershov number and Sethi-Ullman algorithm.
Not all CPU architectures offer 3-register ALU instructions like the MIPS does. The x86 architecture's ALU instructions are 2-register. That is, the output overwrites one of the inputs.
This affects the previously described AllocHRegs()
function a little. It has
to be able to allocate a specific output register through the Alloc()
function such that Alloc()
would allocate the same register as one of the
inputs. So, for example, if the inputs to the add
instruction are in, say,
ax
and cx
, the output will be in ax
if we generate add ax, cx
(or the
other way around, the output will be in cx
if we generate add cx, ax
; keep
in mind that not all binary operators are symmetric).
The 8086-specific idea is:
- When an instruction at a node needs its inputs in specific registers, it
requests its child nodes to produce their outputs in desired registers
(this is done recursively through the familiar
AllocHRegs()
function). However, they may be unable to satisfy the requests due to their own limitations (that is, they can never produce the outputs elsewhere naturally) or because the desired registers already hold something useful. And so the child nodes do what they can and leave it to their parent to do the rest when they can't satisfy the request. - Thankfully, it's always possible to swap values in a few registers with
the
xchg
instruction to satisfy the needs on the input side and such swaps aren't needed too often.
So the AllocHRegs()
, Ensure()
and Allocate()
functions introduced earlier
need to receive an additional parameter, the desired register. It should be
possible to use this parameter to request a specific register, any register or
a register from a specific subset of all allocatable registers.
Currently we're using this enumeration to specify a register:
enum HReg : int
{
// Specific regs begin
HReg0, HRegAX = HReg0,
HReg1, HRegCX = HReg1,
HReg2, HRegDX = HReg2,
HReg3, HRegBX = HReg3,
HReg4, HRegSI = HReg4,
HReg5, HRegDI = HReg5,
// Specific regs end
HRegCnt,
// Constants representing multiple-choice desired regs:
HRegAny = HRegCnt, // unspecified, any reg at all is OK
HRegNotCX, // prefer regs other than cx (for shifts)
HRegNotDXNotAX, // prefer regs other than dx and ax (for (i)div)
HRegNotDXNotCXNotAX, // prefer regs other than dx, cx and ax
HRegByte, // prefer those with individual byte components: ax, cx, dx, bx
HRegByteNotCX, // prefer those with individual byte components except cx: ax, dx, bx
HRegAddr, // prefer those that can be a memory operand: bx, si, di
};
The Allocate()
function will try to allocate the requested/desired register
first, but if it fails to, it'll return a different one and the caller will
need to deal with it (e.g. use the xchg
instruction).
The 8086 shift instructions (shl
, shr
, sar
) take the shift count from the
fixed register cl
(lower half of cx
) if the count isn't a simple 1. The
value that they shift can be anywhere else.
Hence the implementation of AllocHRegs()
for shift instructions should pass
its own desired register parameter (possibly modified to exclude cx
) to the
left child's AllocHRegs()
and Ensure()
while it should pass HRegCX
as the
desired register to the right child's AllocHRegs()
and Ensure()
. Btw, when
both children need the same number of registers to compute we should prefer to
handle the right/count child first, to have higher chances of allocating cx
for the count.
desired register may pass modified to left child
|
v
shift dst, cl
| |
| v
v desired reg = cx
desired reg excludes cx
Ensure()
may fail to allocate and return HRegCX
for the right child. When
this happens, we either move the right child node from that other register to
cx
(if cx
is vacant) or we exchange the registers between that node and the
one currently residing in cx
(when cx
isn't vacant). For this we update the
node(s) and the global array of pointers to nodes (NodeFromHReg[]
) and
generate the mov
or xchg
instruction. We should also remember that
Ensure()
may allocate cx
for the left child, in which case we need to
properly update the left register.
^
|
shift dst, cl
^ ^
| |
case 1: ?x cx Nothing to do
case 2: cx ?x Swap left and right
case 3: ?x ?x Swap right and cx (or move right to cx)
The (i)mul
instruction multiplies ax
by another input and outputs the
product to a pair of registers, dx
:ax
(most significant bits:least
significant bits). That is, the product is twice as wide as either multiplicand.
However, for our purposes, for 16-bit multiplicands whose product isn't
expected to need more than 16 bits, we can treat dx
as a clobbered register.
(i)mul
therefore disregards its caller's desired register since (i)mul
always outputs to ax
. (i)mul
, being symmetric, also asks both its
child/input nodes to produce their results in ax
in the hopes that at least
one would end up in ax
.
desired reg is ignored because output is in ax
|
v
(i)mul ax, src
| |
| v
v desired reg = ax
desired reg = ax
If either of (i)mul
's inputs ends up in dx
, it's kept there, since it can
be conveniently clobbered. However, if none of the inputs is in dx
, dx
may
need to be explicitly preserved (if dx
holds some other partial result that
we don't want (i)mul
to clobber).
To preserve dx
we exchange the registers between one of the input nodes and
the node in dx
. IOW, we reduce this to having one of the inputs in dx
,
which is safe to clobber.
So, through at most 2 register exchanges we can make one input in ax
and the
other, if needed, in dx
.
N.B. the 16 least significant bits of a product of two 16-bit integers are the
same for both signed and unsigned multiplication and so on the x86 we can use
mul
and imul
interchangeably if we're interested in just those 16 bits of
the product.
Computing quotients and remainders with (i)div
is overall trickier.
The instruction divides a pair of registers, dx
:ax
(most significant bits:
least significant bits), that is, a 32-bit integer, by a 16-bit input and the
outputs are in dx
(the remainder) and ax
(the quotient). When we need to
divide 16-bit integers we zero- or sign-extend the dividend from 16 bits to 32
bits, filling dx
with the extension. dx
being part of the dividend naturally
prevents us from using dx
for the divisor.
But first, we need the dividend in ax
, for which we pass HRegAX
as the
desired register into AllocHRegs()
and Ensure()
for the left input node. We
use HRegNotDXNotAX
(meaning any register but dx
or ax
) as the desired
register for the right input node.
desired reg is ignored because output is in ax or dx
|
v
(i)div dx:ax, src
| |
| v
v desired reg excludes dx and ax
desired reg = ax
Then through at most one xchg
we can make sure that the dividend is indeed in
ax
.
If we're unlucky and Ensure()
(or the xchcg
) gives us dx
for the divisor
or dx
contains some other partial result, we Allocate()
another register
(and immediately Free()
it, IOW, we just find an available register, spilling
if necessary) and move to it whatever node is in dx
, thereby freeing dx
from
anything useful.
Finally, when we get to allocate the output register for (i)div
, we allocate
ax
if we're interested in the quotient or we allocate dx
if we're interested
in the remainder.
N.B. when computing how many registers an (i)div
-based division/remainder
node needs, we factor in the above restriction around dx
. This raises
the minimum from 2 to 3 registers, but otherwise the computation remains the
same as for e.g. add
. The order of evaluation and the node numbering described
earlier should take this into account.
To load a value from memory we need the memory address in one of 3 suitable
registers: bx
, si
, di
. So, the child/input node supplying the address will
get its AllocHRegs()
and Ensure()
called with HRegAddr
as the desired
register. If we're unlucky to get the address in one of those registers, we can
use xchg
, as usual, to bring the address into e.g. bx
.
A 16-bit value can be loaded into any 16-bit register, but an 8-bit value can
only be loaded into individual 8-bit subregisters, e.g. al
, bl
, cl
, dl
(the desired register can be set to HRegByte
when we'd like a register with
8-bit subregisters). Further, if we intend to sign-extend an 8-bit value from
memory to 16 bits with the cbw
instruction, we need to load the value into
al
.
On top of that we want to load the value into the desired register if we can do
it directly. So, before we call Allocate()
to allocate the output register
we're going to examine the desired register value and see if it's available and
suitable. Again, if Allocate()
doesn't give us a register with 8-bit
subregisters when we need one, we may use xchg
to get us ax
/al
for the
output register.
desired reg may be ignored and
chosen from regs that have 8-bit subregs
|
v
mov r, [r]
|
v
desired reg can address memory
Stores are a bit simpler than loads, but they have the same fundamental
requirements of the memory address being in one of the 3 suitable registers
(bx
, si
, di
) and the 8-bit value being in one of the 4 suitable registers
(ax
/al
, bx
/bl
, cx
/cl
, dx
/dl
) and we may need to mov
or xchg
to satisfy these requirements.
desired reg may be ignored
|
v
mov [r], r
| |
| v
v desired reg may need to have 8-bit subregs
desired reg can address memory
N.B. the store node's result (besides the direct effect of writing to memory)
can be the value that's being stored similar to how we can write a = b = c;
in C/C++ to use the result of one assignment (b = c
) as a value for another
assignment (a = ...
).
- See Ershov number, Sethi-Ullman algorithm.
- Engineering a Compiler 2nd ed. by Keith D. Cooper and Linda Torczon. In particular, section 13.3.2 "Bottom-Up Local Register Allocation".
- Compiler Construction by William M. Waite and Gerhard Goos. In particular, section 10.2.2 "Targeting".
- Compilers Principles, Techniques, and Tools (AKA the "Dragon" book) by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.