Skip to content

mem2reg

LLVM IR

Ahe compiler needs a data structure to represent code. Currently, I use a syntax tree as an intermediate representation between wend and assembly. While some optimizations are possible with this structure, there are better alternatives.

I have decided that LLVM IR is an excellent fit for my compiler. My project has no external dependencies, and I do not intend to rely on LLVM. However, having the ability to save my intermediate representation to disk and execute it using an external tool is invaluable!

Let's see what Clang does with the following file:

1
2
3
4
5
6
7
int max(int a, int b) {
    int result = a;
    if (b > a) {
        result = b;
    }
    return result;
}

We run clang -cc1 -O0 -emit-llvm -disable-O0-optnone max.c -o max.ll ant get the following file:

max.ll
define dso_local i32 @max(i32 noundef %a, i32 noundef %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %result = alloca i32, align 4
  store i32 %a, ptr %a.addr, align 4
  store i32 %b, ptr %b.addr, align 4
  %0 = load i32, ptr %a.addr, align 4
  store i32 %0, ptr %result, align 4
  %1 = load i32, ptr %b.addr, align 4
  %2 = load i32, ptr %a.addr, align 4
  %cmp = icmp sgt i32 %1, %2
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  %3 = load i32, ptr %b.addr, align 4
  store i32 %3, ptr %result, align 4
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  %4 = load i32, ptr %result, align 4
  ret i32 %4
}

This is Clang's intermediate representation of our int max(int, int) function. It can be better visualized using LLVM tools. The following script renders all .ll files in the current directory:

draw.sh
#!/bin/bash

for ll in `ls -1 *.ll`; do
    opt -passes=dot-cfg -disable-output -cfg-dot-filename-prefix=`basename $ll .ll` $ll
done

for f in `ls -1 *.dot`; do
    dot -Tpng $f -O
done

And here is our int max(int, int):

In essence, this is the same intermediate code file, but here the control flow graph is explicitly visible. LLVM intermediate code is akin to assembly code, which can be broken down into basic blocks of linear code. Each basic block ends with a br branching instruction (or a simple ret), and they can be visualized as edges connecting basic blocks. There are very few instruction types, making it very similar to assembly, but for a virtual machine rather than real hardware.

Notice that Clang has generated a completely unoptimized version of the code. For all three variables a, b, and result, it allocated space on the stack using the alloca instruction, and every access to them is explicitly handled via load and store instructions. Of course, this is highly inefficient, but Clang does not care, as optimizing the code is LLVM’s responsibility, not Clang's.

TinyCompiler -> TinyOptimizer

Well, that works fine — tinycompiler generates basic assembly-like code. Since LLVM IR bytecode is not much different from assembly, I can simply switch the target language templates from GNU assembly to LLVM IR. Since LLVM allows direct execution of its intermediate language files, I will temporarily discard assembly output and reintroduce it only after finishing optimization experiments.

I have no idea where this path will lead me, and I want to keep tinycompiler minimalistic. So, I forked it into tinyoptimizer. By the way, why on earth doesn’t GitHub allow users to fork their own repositories? I had to create a clone instead of a proper fork.

Here is the commit where I switched the language template. Along the way, I made a small change: since assembly lacks variables, I previously accessed them via stack offsets. However, LLVM IR supports proper identifiers and parameter passing (as seen in the earlier example), so why not take advantage of them? This leaves the question of accessing variables declared outside a function. In tinycompiler, I handled this using displays, but now I simply added pointers to external variables as function parameters, and that’s it.

Let me provide an example. Among my test programs, there is this one:

sopfr.wend
main() {

    // Sum of prime factors (with repetition) of a number n.
    // E.g., 40 factors as 2^3 * 5 and sopfr(40) = 2 * 3 + 5 = 11.
    // It is also known as the integer logarithm.
    // https://oeis.org/A001414

    int sopfr(int n) {
        int div;
        int sopfr_aux(int n) {
            int recursion;
            recursion = 0;
            if n % div == 0 {
                recursion = div;
                if n!=div {
                    recursion = recursion + sopfr_aux(n/div);
                }
            } else {
                div = div + 1;
                recursion = sopfr_aux(n);
            }
            return recursion;
        }

        div = 2;
        return sopfr_aux(n);
    }

    println sopfr(42);
}

For this wend program, I generate LLVM IR, which would have been produced by Clang from the following C equivalent:

sopfr.c
#include <stdio.h>

int sopfr_aux(int n, int *div) {
    int recursion = 0;
    if (n % *div == 0) {
        recursion = *div;
        if (n!=*div) {
            recursion = recursion + sopfr_aux(n / *div, div);
        }
    } else {
        *div = *div + 1;
        recursion = sopfr_aux(n, div);
    }
    return recursion;
}

int sopfr(int n) {
    int div = 2;
    return sopfr_aux(n, &div);
}

void main() {
    printf("%d\n", sopfr(42));
}

I convert nested functions into regular ones and pass pointers to the necessary variables. Thus, sopfr_aux automatically receives a pointer to the div variable, which resides in the stack frame of the sopfr function. Of course, I provided the C equivalent just for readability — I generate LLVM IR directly from wend.

Today's topic: mem2reg

Finally, we have reached the main topic of this article. Today, we will explore the mem2reg pass. Let's revisit the function for finding the maximum of two numbers. Clang transforms it into the intermediate code shown on the right in the image below. Now, let's put Clang aside and ask LLVM to improve the code slightly — not fully optimize it, just perform a single explicitly specified pass that promotes memory-stored variables to registers.

We run opt -passes=mem2reg max.ll -S and obtain the transformed code shown on the left in the image. Interestingly, all alloca, load, and store instructions have been completely removed, replaced by a peculiar phi instruction. What exactly is this?

Let's take a look at how LLVM developers define it :).

LLVM intermediate code is based on the Static Single Assignment (SSA) form. The main idea of constructing SSA form is to assign unique names to all assignment results. Virtually all programs contain branches and merge nodes. At merge nodes, we need to add a special form of assignment called a \(\phi\)-function.

A \(\phi\)-function is always executed at the entry of a basic block, and it has the following form: \(x \gets \phi(x_1, x_2, \dots, x_n)\) where \(n\) is the number of predecessor nodes in the control flow graph. The semantics of \(\phi\)-functions is simple: the value of the source \(x_i\), corresponding to the block \(i\) from which the control flow came, is selected and assigned to the variable \(x\). If control flows into the block from its \(j\)-th predecessor, then \(x\) gets the value \(x_j\). Once again, all \(\phi\)-functions are executed before the regular instructions in this node.

So, at the entry to the if.end block, the register %result.0 gets the value %b if the control flow passes through the if.then block, and the value %a if it does not. This results in a kind of ternary conditional operator instead of intensive stack manipulation.

Why are \(\phi\)-functions necessary? Look at the provided examples and note that in SSA form, each register is assigned a value only once. Every time we write an instruction like %a = ... in SSA, it's as if in C we were only allowed to use the assignment operator for initializing constant variables like const int a = .... This does not mean that during the program execution, a will not hold different values. Each basic block can be considered a small function where we declare a local constant. By calling functions with different arguments, we get different constants a. This also means that we can work locally with the block and do not have to worry about the register being redefined somewhere down the graph. There is only one definition, and we build from it.

Just in case, how can we represent a loop where the counter must explicitly change? Let's take a look!

I ran clang and then the mem2reg optimization pass on trivial counter code. Clang placed the variable i on the stack and counted to ten. LLVM, when moving variables from memory to SSA registers, removed alloca/load/store entirely and added a \(\phi\)-function to while.cond block. If during program execution we enter this block from the entry block, then %i.0 gets the initial value. If we enter it from the while.body block, then %i.0 gets the value %inc. Thus, all virtual registers have only one defining instruction, but their immediate value depends on the path taken to reach it.

There are several ways to deduce for this graph that ret always returns 10, and that we can discard all this stuff by inlining the int loop() function call with the constant 10, but we will talk about this some other time.

In principle, these considerations can be attempted with other intermediate code representations, but SSA forms have gained the most popularity at the moment. And the mem2reg optimization pass, which moves variables from memory to SSA registers, plays a key role in optimization. Of course, since no assembly language has equivalents of \(\phi\)-functions, we will eventually have to invoke the reverse pass reg2mem, replacing \(\phi\)-functions with alloca/store/load. But between mem2reg and reg2mem there will be other optimization passes, including constant propagation and much more.

Let's write some code

I've encountered Fibonacci number calculation code so often that I've developed an aversion to it. This is one of the reasons why I spent a lot of time on wend graphical demos. However, now we will be working with intermediate code by hand, and I need it to be somewhat meaningful but extremely simple. So, let's consider a C function that calculates Fibonacci numbers:

fib.c
int fib(int n) {
    int a = 0;
    int b = 1;
    int c = 0;
    int i = 1;
    if (n == 0) return 0;
    while (i < n) {
        c = b;
        b = a + b;
        a = c;
        i = i + 1;
    }
    return b;
}

If we ask clang to generate intermediate code for it, we get this control flow graph:

Let's put my compiler aside for a while and write some independent code. What data structure can we use to represent such a control flow graph in memory? I sketched out this file:

ir.py
class Instruction:
    def __init__(self, string, *args):
        self.string, self.op = string, args # string and parameters (typically three)

    def __repr__(self):
        return self.string.format(op = self.op)

class Phi:
    def __init__(self, reg, t, choice=None):
        self.reg, self.t, self.choice = reg, t, choice or {}

    def __repr__(self):
        choice = ', '.join([ '[{percent}{value}, %{block}]'.format(percent = '%' if type(value) is str else '', value=value, block=block) for block, value in self.choice.items() ])
        return f'%{self.reg} = phi {self.t} {choice}'

class BasicBlock:
    def __init__(self, label):
        self.label = label
        self.phi_functions, self.instructions  = [], []
        self.successors, self.predecessors = set(), set()

    def find_and_replace(self, find, replace):
        for i in self.instructions:
            i.op = list( replace if s==find else s for s in i.op )

    def __repr__(self):
        return f'{self.label}:\n' + \
                ''.join([ f'\t{phi}\n' for phi in self.phi_functions ]) + \
                ''.join([ f'\t{i}\n'   for i   in self.instructions  ])

class ControlFlowGraph:                  # a control flow graph is made of basic blocks and
    def __init__(self, header, footer):  # header + footer strings to form a valid LLVM IR program
        self.header, self.footer, self.blocks, self.last = header, footer, {}, None

    def add_block(self, label):
        self.blocks[label] = BasicBlock(label)
        self.last = self.blocks[label]   # the last block added to the CFG, heavily used by the IR builder

    def __repr__(self):
        return f'{self.header}\n' + \
               ''.join([ f'{block}' for block in self.blocks.values() ]) + \
               f'{self.footer}\n'

    def find_and_replace(self, find, replace):
        for b in self.blocks.values():
            b.find_and_replace(find, replace)

    def compute_adjacency(self):
        for b1 in self.blocks.values():
            if any('unreachable' in i.string or 'ret ' in i.string for i in b1.instructions):
                continue # even if there is a br instruction later on in the block, there are no outgoing edges
            for succ in b1.instructions[-1].op[int('br i1' in b1.instructions[-1].string):]:
                b2 = self.blocks[succ]
                b1.successors.add(b2)
                b2.predecessors.add(b1)

These 55 lines allow me to manipulate intermediate code. I have a class ControlFlowGraph, which is simply a list of objects BasicBlock. Each basic block is just a list of \(\phi\)-functions and regular instructions, as well as a list of links to the predecessors and successors of this block. Predecessors and successors are initially empty but are filled (compute_adjacency()) based on branching instructions. Each instruction is a fairly arbitrary string and a list of constants and register names involved in it.

I took the intermediate code of the int fib(int) function generated by clang and manually entered it into my program:

fib.py
from ir import *
cfg = ControlFlowGraph('define i32 @fib(i32 %n) {', '}')
cfg.add_block('entry')
cfg.last.instructions = [
    Instruction('%{op[0]} = alloca i32', 'retval'),
    Instruction('%{op[0]} = alloca i32', 'n.addr'),
    Instruction('%{op[0]} = alloca i32', 'a'),
    Instruction('%{op[0]} = alloca i32', 'b'),
    Instruction('%{op[0]} = alloca i32', 'i'),
    Instruction('%{op[0]} = alloca i32', 'c'),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', 'n', 'n.addr'),
    Instruction('store i32 {op[0]}, ptr %{op[1]}', 0, 'a'),
    Instruction('store i32 {op[0]}, ptr %{op[1]}', 1, 'b'),
    Instruction('store i32 {op[0]}, ptr %{op[1]}', 1, 'i'),
    Instruction('store i32 {op[0]}, ptr %{op[1]}', 0, 'c'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '0', 'n.addr'),
    Instruction('%{op[0]} = icmp eq i32 %{op[1]}, {op[2]}', 'cmp', '0', 0),
    Instruction('br i1 %{op[0]}, label %{op[1]}, label %{op[2]}', 'cmp', 'if.then', 'if.end')
]

cfg.add_block('if.then')
cfg.last.instructions = [
    Instruction('store i32 {op[0]}, ptr %{op[1]}', 0, 'retval'),
    Instruction('br label %{op[0]}', 'return')
]

cfg.add_block('if.end')
cfg.last.instructions = [
    Instruction('br label %{op[0]}', 'while.cond')
]

cfg.add_block('while.cond')
cfg.last.instructions = [
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '1', 'i'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '2', 'n.addr'),
    Instruction('%{op[0]} = icmp slt i32 %{op[1]}, %{op[2]}', 'cmp1', '1', '2'),
    Instruction('br i1 %{op[0]}, label %{op[1]}, label %{op[2]}', 'cmp1', 'while.body', 'while.end')
]

cfg.add_block('while.body')
cfg.last.instructions = [
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '3', 'b'),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', '3', 'c'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '4', 'a'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '5', 'b'),
    Instruction('%{op[0]} = add i32 %{op[1]}, %{op[2]}', 'add', '4', '5'),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', 'add', 'b'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '6', 'c'),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', '6', 'a'),
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '7', 'i'),
    Instruction('%{op[0]} = add i32 %{op[1]}, {op[2]}', 'add2', '7', 1),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', 'add2', 'i'),

    Instruction('br label %{op[0]}', 'while.cond')
]

cfg.add_block('while.end')
cfg.last.instructions = [
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '8', 'b'),
    Instruction('store i32 %{op[0]}, ptr %{op[1]}', '8', 'retval'),
    Instruction('br label %{op[0]}', 'return')
]

cfg.add_block('return')
cfg.last.instructions = [
    Instruction('%{op[0]} = load i32, ptr %{op[1]}', '9', 'retval'),
    Instruction('ret i32 %{op[0]}', '9')
]
cfg.compute_adjacency()

#from optimizer import *
#mem2reg(cfg)
print(cfg)

The output of this Python script matches the intermediate code generated by clang (well, up to some attributes that I don't need), and it compiles/draws the same way using llvm. Note the commented lines 71-72. Our job for today is to uncomment them ;)

Where to place \(\phi\)-functions?

We have put aside the compiler and written 55+69 lines of Python to manipulate a single Fibonacci numbers example. It uses only local variables and nowhere takes their addresses, so we can safely discard all alloca.

Next, we need to remove all store/load instructions, replacing them with \(\phi\)-functions in the necessary places. And where are those necessary places?

Hint

In principle, there is a rather brute-force method: at the entry of each block (well, except for the entry block), insert a \(\phi\)-function for each variable. Yes, even in blocks with only one predecessor!

This is called the maximal SSA form, and this approach is usually criticized for creating too many redundant \(\phi\)-functions, slowing down compilation (I don't care) and complicates optimization. On the other hand, I strongly suspect that a regular DCE (dead code elimination) pass will clean up the junk completely. When I get to DCE in tinyoptimizer, I should try to calculate the maximal SSA form, it might be simpler.

Remember "premature optimization is the root of all evil"...

Let's start with a fairly obvious reasoning (here and below I am working with the last example): in the if.then block, there is a store instruction i32 0, ptr %retval, which writes the value 0 to the memory at the address %retval. Also, in the return block, there is an instruction %9 = load i32, ptr %retval, which obviously reads memory from the same address %retval into the register %9.

But we can reach the return block without passing through the if.then block! For each store instruction, we need to insert a \(\phi\)-function in the following case:

  1. If there is a path from the entry point of the graph that passes through the block with the store and reaches the block with the load,
  2. And if there is another path from the entry point of the graph that reaches the block with the same load without passing through our store.

It is quite obvious that somewhere these two paths converge before reaching the load, and at this merge point, we need a \(\phi\)-function for our variable %retval.

The basic block where our two paths through the graph converge is one of the elements of the dominance frontier of the block containing the store instruction. And here we need a bit of technical vocabulary.

Dominance Frontiers

Definition One

Block A dominates block B if and only if every path from the start of the graph to block B goes through node A.

It is quite obvious that every basic block dominates itself, and, for example, there are only two blocks that dominate the return block, since only entry and return appear on all possible paths, they are the dominators for return.

Try drawing the graph on paper and determine the set of dominating blocks for each node in the graph. Just in case, so you don't have to scroll back too far, here is the graph again:

Let's synchronize our watches, here is the list of dominators for each block:

{
    entry:      {entry},
    if.then:    {if.then, entry},
    if.end:     {if.end, entry},
    while.cond: {if.end, while.cond, entry},
    while.body: {while.body, if.end, while.cond, entry},
    while.end:  {while.end, if.end, while.cond, entry},
    return:     {return, entry}
}

Definition Two

Block A strictly dominates block B if A dominates B but is not equal to it. Here are the sets of strict dominators for each basic block in our program:

{
    entry:      {},
    if.then:    {entry},
    if.end:     {entry},
    while.cond: {if.end, entry},
    while.body: {if.end, while.cond, entry},
    while.end:  {if.end, while.cond, entry},
    return:     {entry}
}

Definition Three

A is the immediate dominator of B if A strictly dominates B but no other strict dominator of B strictly dominates A.

WAIT, WHAT?

Let's break it down. Each node has a set of strict dominators, right? Well, except for the start node. So, we take the closest one and call it the immediate dominator. Here is the list of immediate dominators for each of our basic blocks:

{
    entry:      None,
    if.then:    entry,
    if.end:     entry,
    while.cond: if.end,
    while.end:  while.cond,
    return:     entry,
    while.body: while.cond
}

This information can also be visualized in the form of a tree called a dominance tree:

Note that the edges of the dominance tree are not necessarily the edges of the original graph.

Definition Four

The dominance frontier of node A is the set of nodes B such that:

  1. A dominates at least one of their predecessors.
  2. A does not strictly dominate the node B itself.

In other words, the dominance frontier of node A consists of nodes where the dominance of A ends: these nodes can be reached through both A and alternative paths that do not pass through A.

If a node in your control flow graph has fewer than two predecessors, it will not be in the dominance frontier of any node, as it cannot be a merge point of competing definitions. The concept of the "dominance frontier" is that these are precisely the nodes where the dominance of a node ends. You can also say that the frontier includes all nodes that have edges coming from the dominance subtree.

The dominance frontier shows places in the CFG where different control flows converge. Frontier nodes are places where different versions of the same variable may meet, and therefore \(\phi\)-functions need to be inserted there.

Let's synchronize our watches, here is the set of nodes in the dominance frontier for each node:

{
    entry:      {}
    if.then:    {return}
    return:     {}
    while.body: {while.cond}
    if.end:     {return}
    while.end:  {return}
    while.cond: {return, while.cond}
}

Let's return to our store operation inside the if.then block. The return block is in the dominance frontier of if.then, so we need to insert a \(\phi\)-function there. It all seems to match up. By the way, note that while.cond is in its own frontier. This is completely normal, no need to worry. But it should be noted that we have a load of the variable %i in the while.cond block, which we can reach either from the if.end block or from the while.body block. So we will need \(\phi\)-functions there as well.

Okay, we have understood the definition of dominance frontiers and can find them manually, but it would be nice to program it. Here are 43 lines of Python that allow us to find the frontiers for each node in the CFG.

Optimizer
from ir import *

class Optimizer:
    def __init__(self, cfg):
        self.cfg, self.entry = cfg, cfg.blocks['entry']
        self.reachable, self.postorder = set(), []  # for efficiency reasons, we visit blocks in a way that ensures
        self.dfs(self.entry)                        # processing of each block after its predecessors have been processed

    def dfs(self, block):
        self.reachable.add(block)
        for b in block.successors:
            if b not in self.reachable: self.dfs(b)
        self.postorder.append(block)

    def dominators(self):                       # the iterative dominator algorithm [Cooper, Harvey and Kennedy, 2006]
        dom = { b : set(self.cfg.blocks.values()) for b in self.cfg.blocks.values() }
        dom[ self.entry ] = { self.entry }
        changed = True
        while changed:
            changed = False
            for b in self.postorder[-2::-1]:    # for all blocks (except the source) in reverse postorder
                new_dom = set.intersection(*[ dom[p] for p in b.predecessors ]) | { b }
                changed = dom[b] != new_dom
                dom[b]  = new_dom
        return dom                              # dom[b] contains every block that dominates b

    def immediate_dominators(self):
        dom = self.dominators()
        idom = { self.entry : None }                                # idom[b] contains exactly one block, the immediate dominator of b
        for b in self.postorder[-2::-1]:                            # reverse or not we do not care here, but do not visit the source block
            idom[b] = max(dom[b] - {b}, key=lambda x: len(dom[x]))  # immediate dominator is the one with the maximum number of dominators (except the block itself)
        return idom

    def dominance_frontiers(self):
        idom = self.immediate_dominators()
        df = { b : set() for b in self.reachable }
        for b in filter(lambda x: len(x.predecessors)>1, self.reachable): # iterate through junction points among reachable blocks
            for p in b.predecessors:
                runner = p
                while runner != idom[b]:
                    df[runner].add(b)
                    runner = idom[runner]
        return df

I didn't invent anything new here, I took the simplest (and at the same time the slowest) methods, which can be easily found on Wikipedia. Everything is calculated exactly according to the definitions I have provided. First, the sets of dominators are calculated, then the dominance tree is derived from them, and the frontiers are derived from the tree.

mem2reg

Armed with the terminology, we can finally clearly define the algorithm for placing \(\phi\)-functions in our control flow graph. Behold!

for each variable v
  queue all blocks containing store v
  while the queue is not empty
    pop one block b1
    for each block b2 in the dominance frontier of b1
      if we haven't placed a $\phi$-function for variable v yet
        insert it
        add b2 to the queue

It's all very simple, but there is one important nuance to remember: inserting a \(\phi\)-function is somewhat akin to adding a store: by adding a \(\phi\)-function to block b2, we must also put it in the queue for processing.

Note that at this stage we are adding \(\phi\)-functions, but not filling in their arguments. We will deal with this when removing load and store. Let's look at the complete code for the mem2reg pass as I have implemented it:

mem2reg
class mem2reg(Optimizer):
    def __init__(self, cfg):
        super().__init__(cfg)
        self.allocas = self.remove_promotable_allocas()
        self.place_phi()
        stack = [{}] # stack of maps (variable -> value); necessary for storing context while branching
        self.remove_store_load(self.entry, None, set(), stack)

    def remove_promotable_allocas(self):
        allocas = { i.op[0] : i.string.split()[-1] for i in self.entry.instructions if 'alloca' in i.string } # find all allocas in the entry block
        for v in allocas.copy().keys():             # for every variable
            for b in self.reachable:
                for i in b.instructions:            # for every instruction in the CFG
                    if f'* %{v}' in i.string:       # if we find a funcall with a reference to the variable
                        allocas.pop(v, None)        # then the variable is not promotable
        for i in self.entry.instructions[:]:
            if 'alloca' in i.string and i.op[0] in allocas:
                self.entry.instructions.remove(i)   # remove promotable allocas
        return allocas # variable name -> type dictionary

    def place_phi(self):                                        # Fig 9.9b in [Cooper and Torczon, Engineering a Compiler Broché]
        df = self.dominance_frontiers()
        phi_places = { v:set() for v in self.allocas.keys() }   # map (variable -> set of basic blocks)
        for v in self.allocas.keys():
            blocks_with_store = { b for b in self.reachable for i in b.instructions if 'store' in i.string and i.op[1]==v }
            blocks_to_consider = blocks_with_store.copy()
            while blocks_to_consider:
                block = blocks_to_consider.pop()
                for frontier in df[block]:
                    if frontier in phi_places[v]: continue
                    phi_places[v].add(frontier)
                    blocks_to_consider.add(frontier)
        for v, bb in phi_places.items():                        # insert phi nodes (for the moment without choices)
            for b in bb:
                b.phi_functions.append(Phi(v + '_' + b.label, self.allocas[v]))

    def remove_store_load(self, block, prev, visited, stack):
        def find_variable(v, stack):                # walk the stack back until
            for frame in reversed(stack):           # the current variable instance is found
                if v in frame: return frame[v]      # N.B. we suppose that the variables were initialized explicitly

        for phi in block.phi_functions:             # place phi node choice for the current path
            v = phi.reg[:-len(block.label)-1]       # phi saves the choice into a register named foo_bar, where foo is the name of the variable, and bar is the name of the basic block
            val = find_variable(v, stack)
            phi.choice[prev.label] = val
            stack[-1][v] = phi.reg

        if block in visited: return                 # we need to revisit blocks with phi functions as many times as we have incoming edges,
        visited.add(block)                          # therefore the visited check is made after the choice placement

        for i in block.instructions[:]:             # iterate through a copy since we modify the list
            if 'load' in i.string and i.op[1] in self.allocas:
                val = find_variable(i.op[1], stack)
                block.instructions.remove(i)
                block.find_and_replace(i.op[0], val)
            elif 'store' in i.string and i.op[1] in self.allocas:
                stack[-1][i.op[1]] = i.op[0]
                block.instructions.remove(i)
            elif 'br i1' in i.string:
                stack.append({})
                self.remove_store_load(self.cfg.blocks[i.op[1]], block, visited, stack)
                stack.pop()
                stack.append({})
                self.remove_store_load(self.cfg.blocks[i.op[2]], block, visited, stack)
                stack.pop()
            elif 'br label' in i.string:
                self.remove_store_load(self.cfg.blocks[i.op[0]],  block, visited, stack)

To call it, you need to uncomment lines 71-72 in the fib.py file.

The mem2reg class inherits from the Optimizer class, so I have full access to dominance functions. The constructor does all the work. First, it finds which variables in the cfg are subject to this optimization using remove_promotable_allocas(). This function removes alloca of local variables from which no address is taken, i.e., it removes all local variables from the stack that are not accessible for modification by other functions. Then place_phi() inserts \(\phi\)-functions according to the above algorithm with frontiers. The remaining task is to fill their arguments, which is done by remove_store_load().

Let's see how it works, using the same Fibonacci numbers example. After removing alloca (in this case, all of them are removed) and placing \(\phi\)-functions, we get the following intermediate code:

\(\phi\)-node placement
define i32 @fib(i32 %n) {
entry:
        store i32 %n, ptr %n.addr
        store i32 0, ptr %a
        store i32 1, ptr %b
        store i32 1, ptr %i
        store i32 0, ptr %c
        %0 = load i32, ptr %n.addr
        %cmp = icmp eq i32 %0, 0
        br i1 %cmp, label %if.then, label %if.end
if.then:
        store i32 0, ptr %retval
        br label %return
if.end:
        br label %while.cond
while.cond:
        %a_while.cond = phi i32 [?], [?]
        %b_while.cond = phi i32 [?], [?]
        %i_while.cond = phi i32 [?], [?]
        %c_while.cond = phi i32 [?], [?]
        %1 = load i32, ptr %i
        %2 = load i32, ptr %n.addr
        %cmp1 = icmp slt i32 %1, %2
        br i1 %cmp1, label %while.body, label %while.end
while.body:
        %3 = load i32, ptr %b
        store i32 %3, ptr %c
        %4 = load i32, ptr %a
        %5 = load i32, ptr %b
        %add = add i32 %4, %5
        store i32 %add, ptr %b
        %6 = load i32, ptr %c
        store i32 %6, ptr %a
        %7 = load i32, ptr %i
        %add2 = add i32 %7, 1
        store i32 %add2, ptr %i
        br label %while.cond
while.end:
        %8 = load i32, ptr %b
        store i32 %8, ptr %retval
        br label %return
return:
        %retval_return = phi i32 [?], [?]
        %a_return = phi i32 [?], [?]
        %b_return = phi i32 [?], [?]
        %i_return = phi i32 [?], [?]
        %c_return = phi i32 [?], [?]
        %9 = load i32, ptr %retval
        ret i32 %9
}

I'm not providing a picture because llvm refuses to draw it :) It makes sense, the code is broken: in the first basic block, the register %a is not defined, and the \(\phi\)-functions, though correctly placed, are not valid instructions without arguments.

Let's do a depth-first search, starting, obviously, from the entry point of the program. So, we have the entry block:

entry:
        store i32 %n, ptr %n.addr
        store i32 0, ptr %a
        store i32 1, ptr %b
        store i32 1, ptr %i
        store i32 0, ptr %c
        %0 = load i32, ptr %n.addr
        %cmp = icmp eq i32 %0, 0
        br i1 %cmp, label %if.then, label %if.end

We go through all the instructions sequentially, removing all store and load instructions while examining their arguments. The first instruction stores %n at the address %n.addr, meaning that any load from this address in this block can be replaced with %n, and in particular, %0 is also replaced with %n. Similarly, we remember the values for %a, %b, %i, %c, and remove their store. After processing, entry looks like this:

entry:
        %cmp = icmp eq i32 %n, 0
        br i1 %cmp, label %if.then, label %if.end

The entry block has two successors - if.then and if.end. Since we are doing a depth-first traversal, we first go to if.then. It looks like this:

if.then:
        store i32 0, ptr %retval
        br label %return

We remove the store, remember that %retval is 0, and go to return. It looks like this:

return:
        %retval_return = phi i32 [?], [?]
        %a_return = phi i32 [?], [?]
        %b_return = phi i32 [?], [?]
        %i_return = phi i32 [?], [?]
        %c_return = phi i32 [?], [?]
        %9 = load i32, ptr %retval
        ret i32 %9

It's time to start filling in the \(\phi\)-functions. We came from the if.then branch, so we remember that %retval in this branch is 0, %a is 0, %b is 1, %i is 1, and %c is 0. We fill in the corresponding argument of the \(\phi\)-functions, remove the load, replacing %9 with %retval_return. After processing, the block looks like this:

return:
        %retval_return = phi i32 [0, %if.then], [?]
        %a_return = phi i32 [0, %if.then], [?]
        %b_return = phi i32 [1, %if.then], [?]
        %i_return = phi i32 [1, %if.then], [?]
        %c_return = phi i32 [0, %if.then], [?]
        ret i32 %retval_return

This is a terminal block, so the depth-first traversal returns to if.end (nothing to do there), then while.cond, and so on.

As a result, my code generates the following control flow graph:

Let's compare it with what llvm generated from clang's output:

As you can see, llvm was a bit smarter: it noticed that a, b, i, c are not used in the return block and did not pass them through, and it also understood that c is overwritten in while.body, so it didn't pass it through while.cond. Otherwise, the result is identical, so I am quite satisfied with my toy compiler.

Well, it's time to integrate this with the compiler. All the tests pass, hooray!

ssloy@periwinkle:~/tinyoptimizer$ make test
Testing test-programs/nontrivial/mandelbrot.wend... ok
Testing test-programs/nontrivial/bitwise.wend... ok
Testing test-programs/nontrivial/trig-hp12c.wend... ok
Testing test-programs/nontrivial/sqrt.wend... ok
Testing test-programs/simple/fixed-point.wend... ok
Testing test-programs/simple/eight-queens.wend... ok
Testing test-programs/simple/mutual-recursion.wend... ok
Testing test-programs/simple/popcount.wend... ok
Testing test-programs/simple/collatz.wend... ok
Testing test-programs/simple/sopfr.wend... ok
Testing test-programs/elementary/helloworld.wend... ok
Testing test-programs/elementary/arithmetic.wend... ok
Testing test-programs/elementary/scope.wend... ok
Testing test-programs/elementary/overload.wend... ok
Testing test-programs/elementary/int-overflow.wend... ok

The compiler code lives here, and you can play with mem2reg manually, without the rest of the compiler, using the files from this article. Here are the links again:


Comments