blob: f7afb690bead41958c0ef78688bd41659188a922 [file] [log] [blame] [view] [edit]
# Design of the Emboss Compiler (`embossc`)
This document describes the internals of the Emboss compiler. End users do not
need to read this document.
## Overall Design
The Emboss compiler follows a reasonably standard compiler design, where the
input source text is first converted to an *intermediate representation* (IR),
then various operations are performed on the IR, and finally the IR is used to
construct the final output — at the time of writing, C++ source code:
```mermaid
%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
diskstart@{ shape: doc, label: "example.emb" }
parser["Parser"]
processing@{ shape: procs, label: "IR processing" }
backend["C++ Code Generator"]
diskend@{ shape: doc, label: "example.emb.h" }
diskstart-- ".emb" -->parser
parser-- IR -->processing
processing-- IR -->backend
backend-- "C++" -->diskend
```
Currently, the Emboss compiler is split into three programs: the *front end*,
which parses the input and does almost all of the IR processing, the *C++
back end*, which does a minimal amount of C++-specific IR processing and
generates the final C++ code, and the `embossc` driver that combines the front
end and back end. This split makes it straightforward to add new back ends
later:
```mermaid
%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
diskstart@{ shape: doc, label: "example.emb" }
parser["Parser"]
processing@{ shape: procs, label: "IR processing" }
cppbackend["C++ Code Generator"]
cppdiskend@{ shape: doc, label: "example.emb.h" }
rustbackend["Rust Code Generator"]
rustdiskend@{ shape: doc, label: "example.emb.rs" }
mdbackend["Documentation Generator"]
mddiskend@{ shape: doc, label: "example.emb.md" }
diskstart-- ".emb" -->parser
parser-- IR -->processing
processing-- IR -->cppbackend
cppbackend-- "C++" -->cppdiskend
processing-- IR -->rustbackend
rustbackend-- Rust -->rustdiskend
processing-- IR -->mdbackend
mdbackend-- Markdown -->mddiskend
```
### IR (Intermediate Representation)
"IR" is a general term: each compiler has its own IR. The Emboss IR is a tree,
with node types defined in [compiler/util/ir_data.py][ir_data_py].
[ir_data_py]: ../compiler/util/ir_data.py
The first stage of the compiler — the parser — generates an "initial" IR, which
only contains information that is directly available in the source tree. Even
without any further information, the initial IR can be quite large even for a
very short `.emb` file. For example, this line:
```emb
0 [+4] UInt file_state
```
turns into this IR (in JSON serialization) immediately after parsing:
```json
{
"location": {
"start": {
"constant": {
"value": "0",
"source_location": "22:3-22:4"
},
"source_location": "22:3-22:4"
},
"size": {
"constant": {
"value": "4",
"source_location": "22:8-22:9"
},
"source_location": "22:8-22:9"
},
"source_location": "22:3-22:10"
},
"type": {
"atomic_type": {
"reference": {
"source_name": [
{
"text": "UInt",
"source_location": "22:13-22:17"
}
],
"source_location": "22:13-22:17"
},
"source_location": "22:13-22:17"
},
"source_location": "22:13-22:17"
},
"name": {
"name": {
"text": "file_state",
"source_location": "22:25-22:35"
},
"source_location": "22:25-22:35"
},
"existence_condition": {
"boolean_constant": {
"value": true,
"source_location": "22:3-22:35"
},
"source_location": "22:3-22:35"
},
"source_location": "22:3-22:35"
}
```
In graphical form (with `source_location` nodes omitted for clarity):
```mermaid
graph TD
n0@{ shape: diamond, label: "field" }
n0 --> n1
n0 --> n4
n0 --> n9
n0 --> n10
n1@{ shape: diamond, label: "location" }
n1 --> n2
n1 --> n3
n2@{ shape: diamond, label: "start" }
n2 --> l0
n3@{ shape: diamond, label: "size" }
n3 --> l1
n4@{ shape: diamond, label: "type" }
n4 --> n5
n5@{ shape: diamond, label: "atomic_type" }
n5 --> n6
n6@{ shape: diamond, label: "reference" }
n6 --> n7
n7@{ shape: diamond, label: "source_names" }
n7 --> n8
n8@{ shape: diamond, label: "source_name" }
n8 --> l2
n9@{ shape: diamond, label: "name" }
n9 --> l3
n10@{ shape: diamond, label: "existence_condition" }
n10 --> l4
l0@{ shape: rect, label: "constant: 0" }
l1@{ shape: rect, label: "constant: 4" }
l2@{ shape: rect, label: "text: UInt" }
l3@{ shape: rect, label: "name: file_state" }
l4@{ shape: rect, label: "boolean_constant: True" }
```
This initial IR then goes through a series of *elaborations*, which annotate
the IR, and *validations*, which check various properties of the IR. These are implemented as *stages* in the compiler:
```mermaid
%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
diskstart@{ shape: doc, label: "example.emb" }
parser["Parser"]
stage1@{ shape: proc, label: "Stage 1" }
stage2@{ shape: proc, label: "Stage 2" }
stages@{ shape: text, label: "..." }
stagenm1@{ shape: proc, label: "Stage N-1" }
stagen@{ shape: proc, label: "Stage N" }
backend["C++ Code Generator"]
diskend@{ shape: doc, label: "example.emb.h" }
diskstart-->parser
parser-->stage1
stage1-->stage2
stage2-->stages
stages-->stagenm1
stagenm1-->stagen
stagen-->backend
backend-->diskend
```
In many cases, elaborations and validations are mixed together — for example,
in the symbol resolution stage, names in the IR (`field`) are *elaborated* with
the absolute symbol to which they resolve (`module.Type.field`), and, at the
same time, the symbol resolver *validates* that every name resolves to exactly
one absolute symbol.
At the end of this process, the IR is much larger:
```mermaid
graph TD
n0@{ shape: diamond, label: "field" }
n0 --> n1
n0 --> n8
n0 --> n15
n0 --> n16
n0 --> n19
n0 --> n22
n1@{ shape: diamond, label: "location" }
n1 --> n2
n1 --> n5
n2@{ shape: diamond, label: "start" }
n2 --> l0
n2 --> n3
n3@{ shape: diamond, label: "type" }
n3 --> n4
n4@{ shape: diamond, label: "integer" }
n4 --> l1
n4 --> l2
n4 --> l3
n4 --> l4
n5@{ shape: diamond, label: "size" }
n5 --> l5
n5 --> n6
n6@{ shape: diamond, label: "type" }
n6 --> n7
n7@{ shape: diamond, label: "integer" }
n7 --> l6
n7 --> l7
n7 --> l8
n7 --> l9
n8@{ shape: diamond, label: "type" }
n8 --> n9
n9@{ shape: diamond, label: "atomic_type" }
n9 --> n10
n10@{ shape: diamond, label: "reference" }
n10 --> n11
n10 --> n13
n11@{ shape: diamond, label: "canonical_name" }
n11 --> l10
n11 --> n12
n12@{ shape: diamond, label: "object_paths" }
n12 --> l11
n13@{ shape: diamond, label: "source_names" }
n13 --> n14
n14@{ shape: diamond, label: "source_name" }
n14 --> l12
n15@{ shape: diamond, label: "write_method" }
n15 --> l13
n16@{ shape: diamond, label: "name" }
n16 --> l14
n16 --> n17
n17@{ shape: diamond, label: "canonical_name" }
n17 --> l15
n17 --> n18
n18@{ shape: diamond, label: "object_paths" }
n18 --> l16
n18 --> l17
n19@{ shape: diamond, label: "attributes" }
n19 --> n20
n20@{ shape: diamond, label: "attribute" }
n20 --> l18
n20 --> n21
n20 --> l20
n21@{ shape: diamond, label: "value" }
n21 --> l19
n22@{ shape: diamond, label: "existence_condition" }
n22 --> l21
n22 --> n23
n23@{ shape: diamond, label: "type" }
n23 --> l22
l0@{ shape: rect, label: "constant: 0" }
l1@{ shape: rect, label: "modulus: infinity" }
l2@{ shape: rect, label: "modular_value: 0" }
l3@{ shape: rect, label: "minimum_value: 0" }
l4@{ shape: rect, label: "maximum_value: 0" }
l5@{ shape: rect, label: "constant: 4" }
l6@{ shape: rect, label: "modulus: infinity" }
l7@{ shape: rect, label: "modular_value: 4" }
l8@{ shape: rect, label: "minimum_value: 4" }
l9@{ shape: rect, label: "maximum_value: 4" }
l10@{ shape: rect, label: "module_file: " }
l11@{ shape: rect, label: "object_path: UInt" }
l12@{ shape: rect, label: "text: UInt" }
l13@{ shape: rect, label: "physical: True" }
l14@{ shape: rect, label: "name: file_state" }
l15@{ shape: rect, label: "module_file: testdata/golden/span" }
l16@{ shape: rect, label: "object_path: LogFileStatus" }
l17@{ shape: rect, label: "object_path: file_state" }
l18@{ shape: rect, label: "name: byte_order" }
l19@{ shape: rect, label: "string_constant: LittleEndian" }
l20@{ shape: rect, label: "is_default: False" }
l21@{ shape: rect, label: "boolean_constant: True" }
l22@{ shape: rect, label: "boolean: True" }
```
### Front End vs Back End(s)
The Emboss compiler is divided into a front end, which does most of the work,
and back ends, which do language-specific validations and translate the final
IR to the final output format. Currently, only a C++ back end exists.
The compiler is structured so that the front end and back end can run as
separate programs, and when building with [Bazel][bazel] they do run
separately. For efficiency, the [`embossc`][embossc_source] driver just
imports the front end and C++ back end directly, so that it can skip the JSON
serialization and deserialization steps.
[bazel]: https://bazel.build/
[embossc_source]: ../embossc
The front end consists of (as of the time of writing) 14 stages:
1. Tokenization
2. Parse Tree Generation
3. Parse Tree → IR
4. Desugaring + Built-In Field Synthesis
5. Symbol Resolution Part 1: Head Symbols
6. Dependency Cycle Checking
7. Dependency Order Computation
8. Symbol Resolution Part 2: Field Access
9. Type Annotation
10. Type Checking
11. Bounds Computation
12. Front End Attribute Normalization + Verification
13. Miscellaneous Constraint Checking
14. Inferred Write Method Generation
Each of these stages will be explained in more detail later in this document.
The C++ back end is much simpler, with only 2 stages:
1. Back-End Attribute Normalization + Verification
2. C++ Header Generation
This lopsidedness is intentional: work done in the front end can be shared by
all back ends, but work in a back end only benefits that specific back end.
### IR Traversal
Most stages walk the IR one (or more) times, processing or checking specific
types of nodes. The Emboss compiler uses a special
[`fast_traverse_ir_top_down()`][traverse_ir] function that takes a *pattern*
and calls a handler on all nodes that match that pattern.
`fast_traverse_ir_top_down()` has been optimized to walk trees efficiently: it
will skip branches if no node in the branch could possibly match the pattern.
[traverse_ir]: ../compiler/util/traverse_ir.py#:~:def%20fast_traverse_ir_top_down
For example, this will call `print()` on every `Word` in the IR:
```
fast_traverse_ir_top_down(ir, [ir_data.Word], print)
```
Or you could limit the scope to only `Word` nodes inside of `Import` nodes:
```
fast_traverse_ir_top_down(ir, [ir_data.Import, ir_data.Word], print)
```
`fast_traverse_ir_top_down()` has a lot of options. For full usage, see
[its documentation in the source code][traverse_ir].
### Error Handling
The general scheme for running stages is:
```
for stage in stages:
next_ir, errors = stage(prev_ir)
if errors:
return None, errors
prev_ir = next_ir
return prev_ir, None
```
That is: run each stage, and if the stage returns any errors, stop processing
and return the error(s).
Errors are represented by lists of [messages][error_py] (`error`, `warning` or
`note`), where, generally, the first message will be an `error` or `warning`,
and any subsequent messages are `note` messages that provide additional
information.
[error_py]: ../compiler/util/error.py
#### Errors on Synthetic Nodes
One important caveat here is that errors on *synthetic* IR nodes — nodes that
did not come directly from the input `.emb` — are deferred: the compiler will
continue processing to try to find an error in a "natural" node to display to
the user. This is covered in more detail in [Built-In Field
Synthesis](#desugaring).
## Front End
[*`compiler/front_end/...`*](../compiler/front_end/)
The front end is responsible for reading in Emboss definitions and producing a
normalized intermediate representation (IR). It is divided into several steps:
roughly, parsing, import resolution, symbol resolution, and validation.
The front end is orchestrated by [`glue.py`][glue_py], which runs each front
end component in the proper order to construct an IR suitable for consumption
by the back end.
[glue_py]: ../front_end/glue.py
The front end driver program is [`emboss_front_end.py`][emboss_front_end_py],
which calls `glue.ParseEmbossFile` and prints the results.
[emboss_front_end_py]: ../front_end/emboss_front_end.py
### Tokenization and Parsing
The first part of the front end translates the text of the `.emb` input into an
IR structure.
#### Tokenization
[*`tokenizer.py`*](../compiler/front_end/tokenizer.py)
The very first stage is tokenization, also called lexical analysis or lexing.
Tokenization breaks the input text into *tokens* of one or more characters.
For example, take the string `abc+def`:
```
+---+---+---+---+---+---+---+
| a | b | c | + | d | e | f |
+---+---+---+---+---+---+---+
```
This will be grouped into three tokens, `abc`, `+`, and `def`:
```
+-----------+-----+-----------+
| abc | + | def |
| SnakeWord | "+" | SnakeWord |
+-----------+-----+-----------+
```
In addition, each token is labeled with a category, such as `Documentation`,
`Number`, or `"*"`: by convention, tokens that are always a specific sequence
of characters are given a label that is just the token in double quotes, such
as `"+"` for `+`, `"=="` for `==`, `"if"` for `if`, and so on.
These category labels match names in the grammar in the next step.
There is one extra bit of logic in the Emboss tokenizer, which mimics CPython:
indentation is tracked and translated into special `Indent` and `Dedent`
tokens when it changes. This fragment:
```emb
struct Foo:
-- word
```
will be tokenized as:
```
+----------+-----------+-----+------+--------+---------------+--------+
| struct | Foo | : | \n | | -- word | |
| "struct" | SnakeName | ":" | "\n" | Indent | Documentation | Dedent |
+----------+-----------+-----+------+--------+---------------+--------+
```
#### Parse Tree Generation
[*`lr1.py`*](../compiler/front_end/lr1.py)
The list of tokens is then passed to a parser, which produces a parse tree.
The point of this stage is to take the list of tokens, and apply a tree
structure on top of it. Starting with the tokens from the previous example:
```mermaid
graph LR
colon@{label: "\":\""}
newline@{label: "\"\\n\""}
struct ~~~ Foo ~~~ colon ~~~ newline ~~~ Indent ~~~ Documentation ~~~ Dedent
```
the parser infers the parse tree:
```mermaid
graph TD
r0@{shape: diamond, label: "module"}
r1@{shape: diamond, label: "comment-line*"}
r2@{shape: diamond, label: "doc-line*"}
r3@{shape: diamond, label: "import-line*"}
r4@{shape: diamond, label: "attribute-line*"}
r5@{shape: diamond, label: "type-definition*"}
r6@{shape: diamond, label: "type-definition"}
r7@{shape: diamond, label: "struct"}
r8@{shape: diamond, label: "type-name"}
r9@{shape: diamond, label: "type-word"}
r10@{shape: diamond, label: "delimited-parameter-definition-list?"}
r11@{shape: diamond, label: "Comment?"}
r12@{shape: diamond, label: "eol"}
r13@{shape: diamond, label: "comment-line*"}
r14@{shape: diamond, label: "struct-body"}
r15@{shape: diamond, label: "doc-line*"}
r16@{shape: diamond, label: "doc-line"}
r17@{shape: diamond, label: "doc"}
r18@{shape: diamond, label: "Comment?"}
r19@{shape: diamond, label: "eol"}
r20@{shape: diamond, label: "comment-line*"}
r21@{shape: diamond, label: "doc-line*"}
r22@{shape: diamond, label: "attribute-line*"}
r23@{shape: diamond, label: "type-definition*"}
r24@{shape: diamond, label: "struct-field-block"}
r25@{shape: diamond, label: "type-definition*"}
subgraph tokens
direction LR
t0@{shape: rect, label: "struct"}
t1@{shape: rect, label: "Foo"}
t2@{shape: rect, label: ":"}
t3@{shape: rect, label: "\\n"}
t4@{shape: rect, label: "Indent"}
t5@{shape: rect, label: "-- word"}
t6@{shape: rect, label: "\\n"}
t7@{shape: rect, label: "Dedent"}
end
r0 --> r1
r0 --> r2
r0 --> r3
r0 --> r4
r0 --> r5
r5 --> r6
r5 --> r25
r6 --> r7
r7 ----> t0
r7 --> r8
r7 --> r10
r7 ----> t2
r7 --> r11
r7 --> r12
r7 --> r14
r8 --> r9
r9 ----> t1
r12 ----> t3
r12 --> r13
r14 ----> t4
r14 --> r15
r14 --> r22
r14 --> r23
r14 --> r24
r14 ----> t7
r15 --> r16
r15 --> r21
r16 --> r17
r16 --> r18
r16 --> r19
r17 ----> t5
r19 ----> t6
r19 --> r20
```
Parsing is a large subject with many good tutorials and references available,
so only the `embossc` specifics are covered here.
The Emboss compiler uses a [shift-reduce (also known as "LR(1)")
parser][shift_reduce], using a [custom engine][lr1_py] that has some features
that are not available in most other shift-reduce engines — and also some
limitations.
[shift_reduce]: https://en.wikipedia.org/wiki/Shift-reduce_parser
[lr1_py]: ../compiler/front_end/lr1.py
The biggest limitation is that the table generator only implements the
[canonical LR(1)][canonical_lr1] table generation algorithm, which means that
there is no "side channel" way of specifying precedence or otherwise resolving
conflicts — shift/reduce and reduce/reduce conflicts must be resolved by
removing them from the grammar. (However, Emboss's grammar uses
partially-ordered precedence, which — as far as the author is aware — is not
supported by any off-the-shelf precedence system.)
[canonical_lr1]: https://doi.org/10.1016/S0019-9958(65)90426-2
The biggest unusual feature is the incorporation of a [*Merr*][merr]-like
system for error messages.
[merr]: https://doi.org/10.1145/937563.937566
The output from this stage is a *parse tree*, not an IR.
#### Parse Tree → IR
[*`module_ir.py`*](../compiler/front_end/module_ir.py)
Finally, the parse tree is transformed into an IR. This is mostly a
straightforward translation, but there are a few places where the parse tree is
noticeably different from the IR. For example, in the parse tree, lists are
explicit *cons* lists, like:
```mermaid
graph TD
r1@{label: "struct-field-block"}
r2@{label: "struct-field*"}
r3@{label: "struct-field"}
r4@{label: "struct-field*"}
r5@{label: "struct-field"}
r6@{label: "struct-field*"}
r7@{label: "struct-field"}
r8@{label: "struct-field*"}
r1 --> r2
r2 --> r3
r2 --> r4
r4 --> r5
r4 --> r6
r6 --> r7
r6 --> r8
```
while in the IR these lists are flattened:
```mermaid
graph TD
n0@{label: "structure"}
n1@{label: "fields"}
n2@{label: "field"}
n3@{label: "field"}
n4@{label: "field"}
n0 --> n1
n1 --> n2
n1 --> n3
n1 --> n4
```
One bit of pure syntax sugar that is handled at this stage is chained
comparisons: `a == b == c` will be translated to the IR equivalent of `a == b
&& b == c`.
### IR Processing
<a name="desugaring"></a>
#### Desugaring + Built-In Field Synthesis
[*`synthetics.py`*](../compiler/front_end/synthetics.py)
The first stage that operates on the IR adds built-in fields like
`$size_in_bytes` and rewrites some constructs to a more regular form.
Specifically, anonymous `bits` will be rewritten into non-anonymous `bits`, and
`$next` will be replaced with the offset + size of the previous field.
Starting with this:
```emb
struct Foo:
0 [+4] bits:
0 [+1] Flag low
31 [+1] Flag high
if low:
$next [+4] UInt field
```
`synthetics.py` will rewrite the IR into (the equivalent of):
```emb
struct Foo:
bits EmbossReservedAnonymous0:
[text_output: "Skip"]
0 [+1] Flag low
31 [+1] Flag high
let $size_in_bytes = $max(0, true ? 0+4 : 0, low ? (0+4)+4 : 0)
let $max_size_in_bytes = $upper_bound($size_in_bytes)
let $min_size_in_bytes = $lower_bound($size_in_bytes)
0 [+4] EmbossReservedAnonymous0 emboss_reserved_anonymous_1
if low:
0+4 [+4] UInt field
let low = emboss_reserved_anonymous_1.low
let high = emboss_reserved_anonymous_1.high
```
One important detail is that the newly-created nodes are marked as *synthetic*.
This makes a difference for error handling: because of the way that some of the
synthetic nodes are constructed from copied pieces of existing IR, errors can
sometimes be detected in synthetic nodes before they are detected in the
original source nodes they were copied from. These errors tend to be very
confusing to end users, since they reference parts of the IR that were not
entered by an end user. For example, the incorrect:
```emb
struct Foo:
false [+4] UInt field
```
will have a synthesized `$size_in_bytes` field:
```emb
struct Foo:
let $size_in_bytes = $max(0, true ? false+4 : 0)
false [+4] UInt field
```
The erroneous expression `false+4` will be detected before the top-level type
error of using `false` as the field offset, with a message like:
```
example.emb:2:5: error: Left argument of operator '+' must be an integer.
false [+4] UInt field
^^^^^
```
To avoid confusing end users, any error with a synthetic location will be
*deferred*, and only shown to an end user if no non-synthetic errors are
encountered during IR processing. In this case, the correct error message will
be found and displayed to the end user:
```
example.emb:2:5: error: Start of field must be an integer.
false [+4] UInt field
^^^^^
```
#### Symbol Resolution Part 1: Head Symbols
[*`symbol_resolver.py`*](../compiler/front_end/symbol_resolver.py)
After desugaring, the first part of symbol resolution occurs: resolving *head*
symbols. Head symbols are any symbols that do not follow a `.`: for example,
`head` or `Head` in `head`, `Head.field` and `head.Type.field`. Resolution of
symbols after `.` happens in a later stage, after some other checking.
Head symbol resolution follows lexical scoping rules: a symbol may refer to
entities in a set of *scopes* determined by their position in the source text.
Head symbol resolution is straightforward: while walking down the IR tree,
build up a list of available scopes. When a symbol is encountered, check to
see if that symbol is *defined* in any of the available scopes. If it is
defined in *exactly one* available scope, bind the symbol to that definition.
If it is not defined, raise an error. Emboss takes the unusual step of *also*
raising an error if the symbol is defined in two or more scopes: in most
computer languages, there is some scope precedence, and the highest-precedence
definition is used. However, this can lead to surprising, and in some cases
*silently incorrect* results when the desired definition is *shadowed* by an
unknown-to-the-user higher-precedence definition, so Emboss raises an error and
requires the user to explicitly pick the correct definition.
#### Dependency Cycle Checking
[*`dependency_checker.py`*](../compiler/front_end/dependency_checker.py)
Once head symbols have been resolved, there is enough information in the IR to
check for cycles in the dependency graph: cases where `object_1` depends on
`object_2` and `object_2` also depends on `object_1`, sometimes through a chain
of multiple dependencies.
The Emboss compiler accomplishes this by first building up the complete
dependency graph of the IR, and then running a graph partitioning algorithm
([Tarjan's algorithm][tarjan], in this case) on the dependency graph to find
all strongly-connected components. If there are any partitions with more than
one node, or any nodes with a self edge, then there is a dependency cycle.
[tarjan]: https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm
#### Dependency Order Computation
[*`dependency_checker.py`*](../compiler/front_end/dependency_checker.py)
Every structure in the IR is also annotated with a 'dependency ordering': an
ordering of its fields such that any field *A* that depends on another field
*B* is later in the ordering than *B*.
This ordering is not used until the back end (where it is used to order fields
for text output), but it is convenient to build it here, where the dependency
graph has already been built.
#### Symbol Resolution Part 2: Field Access
[*`symbol_resolver.py`*](../compiler/front_end/symbol_resolver.py)
Next, "tail" symbols are resolved: any symbol immediately after `.`. These are
resolved after checking for dependency cycles because the resolution algorithm
can become stuck in infinite recursion if there is a dependency cycle.
#### Type Annotation
[*`type_check.py`*](../compiler/front_end/type_check.py)
With symbols resolved, it is possible to annotate expression types. The Emboss
compiler annotates *every* expression and subexpression with a type: at this
stage, those types are coarse: integer, enum, boolean, string, "opaque", etc.
This stage also checks expressions for *internal* type consistency. For
example, it will find the error in:
```emb
struct Foo:
let a = 10 + true
```
but not in:
```emb
struct Foo:
true [+4] UInt a
```
#### Type Checking
[*`type_check.py`*](../compiler/front_end/type_check.py)
After annotating all types, the compiler checks that expressions have the
correct top-level types: that is, that their types are correct for their
positions within the IR. This is the stage that detects the use of a boolean
as a field offset from the previous example:
```emb
struct Foo:
true [+4] UInt a
```
#### Bounds Computation
[*`expression_bounds.py`*](../compiler/front_end/expression_bounds.py)
With coarse type checking done, the expression types can be *refined* into
narrower types: for example, instead of just "integer," "integer between 0 and
65535, inclusive."
Fully understanding this stage involves (just a tiny bit!) of Type Theory:
mainly, that a "type" in the computer science sense is just a *set* (in the
mathematical sense) of all the values that a variable of that type could
possibly hold. This sort of "type" is conflated with physical layout in most
industrial computing languages (including Emboss), but is conceptually
separate.
The boolean type is the set $\left\\{ \scriptstyle{FALSE}, \scriptstyle{TRUE}
\right\\}$, so its only refinements (nonempty subsets) are $\left\\{
\scriptstyle{FALSE} \right\\}$ and $\left\\{ \scriptstyle{TRUE} \right\\}$.
This stage will try to make those refinements, where possible.
The base integer type in the Emboss expression language is the set of all
integers, normally written as $\mathbb{Z}$. It is obviously impossible for any
real-world computer to have a variable of type $\mathbb{Z}$ — there is no
computer with infinite memory — so this type needs to be refined.
In Emboss, currently, integer types are refined into the form $\left\\{ x | x
\in \mathbb{Z} \wedge a \le x \le b \wedge x \equiv n \pmod m \right\\}$: that
is, $x$ is an integer ($x \in \mathbb{Z}$), $x$ is at least $a$ and at most $b$
($a \le x \le b$), and $x$ % $m$ == $n$ ($x \equiv n \pmod m$). The values $a$
and $b$ are lower and upper bounds, respectively, while $m$ and $n$ are used
for alignment tracking — $m$ tracks the "level" of alignment, and $n$ tracks
the "offset" within the alignment.
This integer type refinement starts with known types. A value read from an
`Int` or `UInt` field will have known bounds (based on the bit width of the
field), $m = 1$ and $n = 0$. A constant $c$ has $a = b = n = c$ and $m =
\infty$.
From this basis, the refinements for any operations can be determined. For
example, for `x + y`, the resulting lower and upper bounds $a = a_x + a_y$ and
$b = b_x + b_y$. The resulting alignment is a little more complex, involving
the greatest common divisor function.
The Emboss functions `$lower_bound(x)` and `$upper_bound(x)` are a little
strange: their result is a constant that is the lower or upper bound
(respectively) of their argument! That is, they extract a value from the type
system and return it as an expression. There aren't a lot of human-authored
uses of these functions, but they are used in the definitions of fields like
`$min_size_in_bytes` and `$max_size_in_bits`.
The refinements that Emboss finds are not always perfectly optimal, for various
reasons: for example, even if $x < 0$, `x * x` cannot be negative, but Emboss
does not try to detect any *cross-argument constraints*, so it will not figure
this out.
It might be tempting to add better refinement to Emboss, but there are
downsides. First, new refinements take up compiler time, which can contribute
to slow builds. Many possible refinements would only be found on a tiny
fraction of all expressions, so it is not worth the compiler time to search for
them.
Second, and probably more importantly, any new refinements become part of the
Emboss contract with end users, forever: `$upper_bound()` must never return a
larger value than it did in previous versions, and `$lower_bound()` must never
return a smaller value. Reduced alignment bounds can cause performance
regressions in the highest-performance code. This has direct implications for
maintainence of the current Emboss compiler, but it also makes it even harder
for any third-party compiler to fully implement Emboss — which, in turn, means
that Emboss is even more likely to be a single-vendor language, like Python or
Ruby, instead of a multi-vendor language like C++ or JavaScript.
#### Front End Attribute Normalization + Verification
[*`attribute_checker.py`*](../compiler/front_end/attribute_checker.py`)
This stage checks the attributes (annotations like `[byte_order:
"LittleEndian"]`) that are used by the Emboss front end, and adds missing
attributes where needed.
Attributes were originally intended to be a way to add functionality without
having to add new syntax, so their semantics are not conceptually cohesive.
Nonetheless, they are all handled together in the same stage.
This stage checks that all attributes have the correct argument type (no
`[byte_order: 10]`) and that their values are valid (no `[byte_order:
"scrambled"]`). `$default`'ed attributes have their values copied into every
place in the IR where they make sense and were not overridden (so `[$default
byte_order: "BigEndian"]` will cause `[byte_order: "BigEndian"]` to be attached
to most fields that do not already have a `[byte_order]` attribute).
#### Miscellaneous Constraint Checking
[*`constraints.py`*](../compiler/front_end/constraints.py)
Even more conceptually incoherent than the previous stage, constraint checking
verifies that certain things are true about the IR:
* Fields in `bits` have types that are allowed in `bits`.
* Arrays have constant sizes for all dimensions other than the last one,
which may be empty.
* Arrays in `struct` are arrays of types that are a whole number of bytes
long (no `UInt:4[]`).
* In `T[]`, `T` has a constant size.
* That fields with `external` types like `Int`, `Flag`, etc. meet the
requirements for those types, in terms of size.
* That enum, field, and type names are not reserved words in any existing
mainstream programming language as of 2014, even if they are not Emboss
keywords. For example, no field may be named `class`, even though Emboss
does not care.
* That constant references (references of the form `Type.field`) are actually
constant.
* That all enum values can be represented as 64-bit unsigned or signed
2's-complement integers (or narrower, if the `enum` has an explicit,
narrower width).
* That all expression values *that might be computed at runtime* fit within
64-bit unsigned or signed 2's-complement integers.
The common theme of these is that they are not used anywhere in the front end,
but they simplify the back end by reducing the cases that have to be handled.
Most of these constraints could be lifted at some point.
#### Inferred Write Method Generation
[*`write_inference.py`*](../compiler/front_end/write_inference.py)
This stage — the last before the IR goes to the band end — figures out how to
write through virtual fields. There are three possibilities:
If the virtual field is a direct alias (`let field = some_other_field`), then
writing through `field` is exactly the same as writing through
`some_other_field`. In fact, it is *so* exactly the same that the C++ back end
handles this by making the `field()` method call just return the view of
`some_other_field()` directly. This works even if `some_other_field` is
actually a nested field, like `some.other.field`.
This method is most important for anonymous `bits`: it is important to be able
to write through the aliases because the original field is not accessible to
end users.
If the virtual field is a series of additions and subtractions on exactly one
other field, like `let f_plus_10 = f + 10`, then a special "write" expression
will be generated that inverts the calculation: in this case, `f_plus_10 - 10`.
It was the intention of the Emboss authors to expand this inference over time,
but (currently) addition and subtraction are the only fully-invertible
operations in the Emboss expression language.
Lastly, if neither of the other options apply, the field is read-only.
It was also the intention to open this feature up to users by allowing "write
method" attributes to be attached to virtual fields, like:
```emb
# An example from an actual device, where `x` is a frequency expressed in
# either Hz or MHz.
let f = x < 1000 ? x * 1000000 : x
[write_field: x]
[write_expression: value]
```
## C++ Back End
[*`compiler/back_end/cpp`*](../compiler/back_end/cpp)
The C++ back end transforms the fully elaborated IR into a header file.
### C++ Back-End Attribute Verification
[*`attributes.py`*](../compiler/back_end/cpp/attributes.py)
[*`header_generator.py`*](../compiler/back_end/cpp/header_generator.py)
The first stage of the C++ back end is a short one: it verifies that the
`[(cpp) namespace]` attribute, if present, is a valid C++ namespace, and that
any `[(cpp) enum_case]` attributes are valid.
### C++ Header Generation
[*`header_generator.py`*](../compiler/back_end/cpp/header_generator.py)
The very last stage walks the IR and produces the text of a C++ `.h` file.
Unlike most stages, C++ header generation does *not* use
`fast_traverse_ir_top_down()`: the code to walk the tree is manual, generally
assembling fragments of C++ code from the bottom up.
During most steps of this traversal, the C++ back end builds up three separate
sections: a forward declaration section, a class definition section, and then a
method definition section, which are eventually emitted in that specific order.
This structure lets any circular references (e.g., Emboss structure types that
mutually contain each other) work without the back end having to track anything
special: by the time a type is referenced, it will always have been declared,
and by the time a type is actually *used* it will always have been defined.
This stage has the most code of any single stage of the Emboss compiler, so it
can be difficult to follow. (It is also some of the oldest code in the
compiler, which does not help.)