3.1. LLVMlite Specification¶
3.1.1. Overview¶
3.1.2. LLVMlite Syntax¶
define i64 @fac(i64 %n) { ; (1) %1 = icmp sle i64 %n, 0 ; (2) br i1 %1, label %ret, label %rec ; (3) ret: ; (4) ret i64 1 rec: ; (5) %2 = sub i64 %n, 1 ; (6) %3 = call i64 @fac(i64 %2) ; (7) %4 = mul i64 %n, %3 ret i64 %4 ; (8) } define i64 @main() { ; (9) %1 = call i64 @fac(i64 6) ret i64 %1 }
First, notice the function definition at (1). The i64 annotations declare the return type and the type of the argument n. The argument is prefixed with "%" to indicate that it's an identifier local to the function, while fac is prefixed with "@" to indicate that it is global (i.e. in scope in the entire compilation unit).
Next, at (2) we have the first instruction of the body of fac, which performs a signed comparison of the argument %n and 0 and assigns the result to the temporary %1. The instruction at (3) is a "terminator", and marks the end of the current block. It will transfer control to either ret at (4) or rec at (5). The labels at (4) and (5) each indicate the beginning of a new block of instructions. Notice that the entry block starting at (2) is not labeled: in LLVM it is illegal to jump back to the entry block of a function body. Moving on, (6) performs a subtraction and names the result %2. The i64 annotation indicates that both operands are 64-bit integers. The function fac is called at (7), and the result named %3. Again, the i64 annotations indicate that the single argument and the returned value are 64-bit integers.
Finally, (8) returns from the function with the result named by %4, and (9) defines the main function of the program, which simply calls fac with a literal i64 argument.
Program Structure
LLVMlite programs consist of three types of global definitions: function definitions, global data definitions, and named type definitions, which may be interleaved. These definitions are in scope for the entire compilation unit, may be mutually recursive, and need not be declared in order.
Types
Functions, global data definitions, and instructions are explicitly annotated with types. These are divided into "simple" types that may appear on the stack and as arguments to functions and "aggregate" types that may only appear in global and heap-allocated data. (Unlike full LLVM, LLVM lite does not allow locals to hold structured data.) There is also a "void" type that only appears in the return type of instructions and functions that do not return a value. This is essentially the OCaml unit type, but it has the additional restriction that it cannot appear as the type of an operand, so it is actually illegal to give it a name in the LLVM concrete syntax. In the following table we use T to range over simple and aggregate (non-void, non-function) types, F to range over function types, and S to range over simple types.
Concrete Syntax |
Kind |
Description |
void | void | Indicates the instruction does not return a usable value. |
i1, i64 | simple | 1-bit (boolean) and 64-bit integer values. |
T* | simple | Pointer that can be dereferenced if its target is compatible with T |
i8* | simple | Pointer to the first character in a null-terminated array of bytes. Note: i8* is a valid type, but just i8 is not. LLVMlite programs do not operate over byte-sized integer values. |
F* | simple | Function pointer |
S(S1, ..., SN) | function | A function from S1, ..., SN to S |
void(S1, ..., SN) | function | A function from S1, ..., SN to void |
{ T1, ..., TN } | aggregate | Tuple of values of types T1, ..., TN |
[ N x T ] | aggregate | Exactly N values of type T |
%NAME | * | Abbreviation defined by a top-level named type definition |
Named Types
Named type definitions
%IDENT = type T
define abbreviations for types in the scope of the entire compilation unit. The following specification assumes that these are replaced with their definitions whenever they are encountered. Note that recursive types, in which T mentions %IDENT are allowed, but for the type to be well formed, each such recursive occurrence must appear under a *. More generally, any collection of named types may be mutually recursive (i.e. the names may appear in the the definitions), but each cycle of such references must be broken by a *.
Global Definitions
The next kind of top-level definition is global data
@IDENT = global T Gwhere G ranges over global initializers, described in the following table, and T is the associated type. The global identifier @IDENT, when used in the program, has type T*. For example, the following program fragment has valid annotations:
@foo = global i64 42 @bar = global i64* @foo @baz = global i64** @bar
Concrete Syntax |
Type |
Description |
null | T* | The null pointer constant. |
[0-9]+ | i64 | 64-bit integer literal. |
@IDENT | T* | Global identifier. The type is always a pointer of the type associated with the global definition. |
c"[A-z]*\00" | [ N x i8 ] | String literal. The size of the array N should be the length of the string in bytes, including the null terminator \00. |
[ T G1, ..., T GN ] | [ N x T ] | Array literal. |
{ T1 G1, ..., TN GN } | {T1,...,TN}   | Struct literal. |
bitcast (T1* G1 to T2*) | T2* | Bitcast. |
Operands
We now turn to the parts of a function declaration. Each instruction in a function has zero or more operands which for the purposes of determining the well-formedness of programs, are restricted to the following types.
Concrete Syntax |
Type |
Description |
null | T* | The null pointer constant |
[0-9]+ | i64 | 64-bit integer literal |
@IDENT | T* | Global identifier. The type can always be determined from the global definitions and is always a pointer |
%IDENT | S | Local identifier: can only name values of simple type. The type determined by an local definition of %IDENT in scope |
Instructions and Terminators
The following table describes the restrictions on the types that may appear as parameters of well-formed instructions, and the constraints on the operands and result of the instruction for the purposes of type-checking. We assume that named types have been replaced by their definitions.
For example, in the call instruction, each type parameter S1, ..., SN must be a simple type. When we type check a program containing this instruction, we must make sure that the operand OP1 has exactly the function pointer type S1(S2, ..., SN)*, and that the remaining operands OP2, ..., OPN have types S2, ..., SN.
Concrete Syntax |
Operand → Result Types |
%L = BOP i64 OP1, OP2 | i64 x i64 → i64 |
%L = alloca S | - → S* |
%L = load S* OP | S* → S |
store S OP1, S* OP2 | S x S* → void |
%L = icmp CND S OP1, OP2 | S x S → i1 |
%L = call S1 OP1(S2 OP2, ..., SN OPN) | S1(S2, ..., SN)* x S2 x ... x SN → S1 |
call void OP1(S2 OP2, ... ,SN OPN) | void(S2, ..., SN)* x S2 x ... x SN → void |
%L = getelementptr T1* OP1, i32 OP2, ..., i32 OPN | T1* x i64 x ... x i64 -> GEPTY(T1, OP1, ..., OPN)* |
%L = bitcast T1* OP to T2* | T1* → T2* |
Getelementptr Well-Formedness and Result Type
The getelementptr instruction has some additional well-formedness requirements. Operands after the first must all be constants, unless they are used to index into an array. LLVM actually requires the operands used to index into structs to be 32-bit integers. Rather than introducing 32-bit integers into our language, we will use our 64-bit constants and operands and assume the arguments of getelementptr always fall in the range [0, Int32.max_int].
In the table above, the result type of a getelementptr instruction described using the GEPTY function, which is defined in pseudocode as follows:
GEPTY : T -> operand list -> T GEPTY T operand::path' = GEPTY' T path' GEPTY' : T -> operand list -> T GEPTY' T [] = T GEPTY' { T1, ..., TN } (Const m)::path' = GEPTY' Tm path' when m <= N GEPTY' [ _ x T ] operand::path' = GEPTY' T path'Notice that GEPTY is a partial function. When GEPTY is not defined, the corresponding instruction is malformed. This happens when, for example:
- The list of index operands provided is empty
- An operand used to index a struct is not a constant
- The type is not an aggregate and the list of indices is not empty
Blocks, CFGs, and Function Definitions
A block (or "basic" block) is just a sequence of instructions followed by a terminator:
Concrete Syntax |
Operand → Result Types |
ret void | - → - |
ret S OP | S → - |
br label %LAB | - → - |
br i1 OP, label %LAB1, label %LAB2   | i1 → - |
The body of a function is represented by a control flow graph (CFG). A CFG consists of a distinguished entry block and a sequence blocks of prefixed with a label LAB:. A function definition has a return type, the function name, a list of formal parameters and their types, and the body of the function. The full syntax of a function definition is then:
define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}
Like global data definitions, the type of the defined global identifier @IDENT is S(S1, ... , SN)* or void(S1, ... , SN)*, a function pointer.
There are some additional global well-formedness requirements for function definitions. Each label and local definition must be unique. In this way, a local identifier both names the result of an instruction and serves to identify the instruction within a function body. For the locals in a CFG to be well scoped, there must never be a path to a use of a local that does not pass through its definition. We will not go into the details here.