doc/compiler-design.md - third_party/github/google/emboss - Git at Google

 # Design of the Emboss Compiler (`embossc`)

 This document describes the internals of the Emboss compiler.  End users do not
 need to read this document.


 ## Overall Design

 The Emboss compiler follows a reasonably standard compiler design, where the
 input source text is first converted to an *intermediate representation* (IR),
 then various operations are performed on the IR, and finally the IR is used to
 construct the final output — at the time of writing, C++ source code:

 ```mermaid
 %%{init: {"flowchart": {"htmlLabels": false}} }%%
 graph LR
     diskstart@{ shape: doc, label: "example.emb" }
     parser["Parser"]
     processing@{ shape: procs, label: "IR processing" }
     backend["C++ Code Generator"]
     diskend@{ shape: doc, label: "example.emb.h" }
     diskstart-- ".emb" -->parser
     parser-- IR -->processing
     processing-- IR -->backend
     backend-- "C++" -->diskend
 ```

 Currently, the Emboss compiler is split into three programs: the *front end*,
 which parses the input and does almost all of the IR processing, the *C++
 back end*, which does a minimal amount of C++-specific IR processing and
 generates the final C++ code, and the `embossc` driver that combines the front
 end and back end.  This split makes it straightforward to add new back ends
 later:

 ```mermaid
 %%{init: {"flowchart": {"htmlLabels": false}} }%%
 graph LR
     diskstart@{ shape: doc, label: "example.emb" }
     parser["Parser"]
     processing@{ shape: procs, label: "IR processing" }
     cppbackend["C++ Code Generator"]
     cppdiskend@{ shape: doc, label: "example.emb.h" }
     rustbackend["Rust Code Generator"]
     rustdiskend@{ shape: doc, label: "example.emb.rs" }
     mdbackend["Documentation Generator"]
     mddiskend@{ shape: doc, label: "example.emb.md" }
     diskstart-- ".emb" -->parser
     parser-- IR -->processing
     processing-- IR -->cppbackend
     cppbackend-- "C++" -->cppdiskend
     processing-- IR -->rustbackend
     rustbackend-- Rust -->rustdiskend
     processing-- IR -->mdbackend
     mdbackend-- Markdown -->mddiskend
 ```


 ### IR (Intermediate Representation)

 "IR" is a general term: each compiler has its own IR.  The Emboss IR is a tree,
 with node types defined in [compiler/util/ir_data.py][ir_data_py].

 [ir_data_py]: ../compiler/util/ir_data.py

 The first stage of the compiler — the parser — generates an "initial" IR, which
 only contains information that is directly available in the source tree.  Even
 without any further information, the initial IR can be quite large even for a
 very short `.emb` file.  For example, this line:

 ```emb
   0  [+4]   UInt        file_state
 ```

 turns into this IR (in JSON serialization) immediately after parsing:

 ```json
 {
   "location": {
     "start": {
       "constant": {
         "value": "0",
         "source_location": "22:3-22:4"
       },
       "source_location": "22:3-22:4"
     },
     "size": {
       "constant": {
         "value": "4",
         "source_location": "22:8-22:9"
       },
       "source_location": "22:8-22:9"
     },
     "source_location": "22:3-22:10"
   },
   "type": {
     "atomic_type": {
       "reference": {
         "source_name": [
           {
             "text": "UInt",
             "source_location": "22:13-22:17"
           }
         ],
         "source_location": "22:13-22:17"
       },
       "source_location": "22:13-22:17"
     },
     "source_location": "22:13-22:17"
   },
   "name": {
     "name": {
       "text": "file_state",
       "source_location": "22:25-22:35"
     },
     "source_location": "22:25-22:35"
   },
   "existence_condition": {
     "boolean_constant": {
       "value": true,
       "source_location": "22:3-22:35"
     },
     "source_location": "22:3-22:35"
   },
   "source_location": "22:3-22:35"
 }
 ```

 In graphical form (with `source_location` nodes omitted for clarity):

 ```mermaid
 graph TD
     n0@{ shape: diamond, label: "field" }
     n0 --> n1
     n0 --> n4
     n0 --> n9
     n0 --> n10
     n1@{ shape: diamond, label: "location" }
     n1 --> n2
     n1 --> n3
     n2@{ shape: diamond, label: "start" }
     n2 --> l0
     n3@{ shape: diamond, label: "size" }
     n3 --> l1
     n4@{ shape: diamond, label: "type" }
     n4 --> n5
     n5@{ shape: diamond, label: "atomic_type" }
     n5 --> n6
     n6@{ shape: diamond, label: "reference" }
     n6 --> n7
     n7@{ shape: diamond, label: "source_names" }
     n7 --> n8
     n8@{ shape: diamond, label: "source_name" }
     n8 --> l2
     n9@{ shape: diamond, label: "name" }
     n9 --> l3
     n10@{ shape: diamond, label: "existence_condition" }
     n10 --> l4
     l0@{ shape: rect, label: "constant: 0" }
     l1@{ shape: rect, label: "constant: 4" }
     l2@{ shape: rect, label: "text: UInt" }
     l3@{ shape: rect, label: "name: file_state" }
     l4@{ shape: rect, label: "boolean_constant: True" }
 ```

 This initial IR then goes through a series of *elaborations*, which annotate
 the IR, and *validations*, which check various properties of the IR.  These are implemented as *stages* in the compiler:

 ```mermaid
 %%{init: {"flowchart": {"htmlLabels": false}} }%%
 graph LR
     diskstart@{ shape: doc, label: "example.emb" }
     parser["Parser"]
     stage1@{ shape: proc, label: "Stage 1" }
     stage2@{ shape: proc, label: "Stage 2" }
     stages@{ shape: text, label: "..." }
     stagenm1@{ shape: proc, label: "Stage N-1" }
     stagen@{ shape: proc, label: "Stage N" }
     backend["C++ Code Generator"]
     diskend@{ shape: doc, label: "example.emb.h" }
     diskstart-->parser
     parser-->stage1
     stage1-->stage2
     stage2-->stages
     stages-->stagenm1
     stagenm1-->stagen
     stagen-->backend
     backend-->diskend
 ```

 In many cases, elaborations and validations are mixed together — for example,
 in the symbol resolution stage, names in the IR (`field`) are *elaborated* with
 the absolute symbol to which they resolve (`module.Type.field`), and, at the
 same time, the symbol resolver *validates* that every name resolves to exactly
 one absolute symbol.

 At the end of this process, the IR is much larger:

 ```mermaid
 graph TD
     n0@{ shape: diamond, label: "field" }
     n0 --> n1
     n0 --> n8
     n0 --> n15
     n0 --> n16
     n0 --> n19
     n0 --> n22
     n1@{ shape: diamond, label: "location" }
     n1 --> n2
     n1 --> n5
     n2@{ shape: diamond, label: "start" }
     n2 --> l0
     n2 --> n3
     n3@{ shape: diamond, label: "type" }
     n3 --> n4
     n4@{ shape: diamond, label: "integer" }
     n4 --> l1
     n4 --> l2
     n4 --> l3
     n4 --> l4
     n5@{ shape: diamond, label: "size" }
     n5 --> l5
     n5 --> n6
     n6@{ shape: diamond, label: "type" }
     n6 --> n7
     n7@{ shape: diamond, label: "integer" }
     n7 --> l6
     n7 --> l7
     n7 --> l8
     n7 --> l9
     n8@{ shape: diamond, label: "type" }
     n8 --> n9
     n9@{ shape: diamond, label: "atomic_type" }
     n9 --> n10
     n10@{ shape: diamond, label: "reference" }
     n10 --> n11
     n10 --> n13
     n11@{ shape: diamond, label: "canonical_name" }
     n11 --> l10
     n11 --> n12
     n12@{ shape: diamond, label: "object_paths" }
     n12 --> l11
     n13@{ shape: diamond, label: "source_names" }
     n13 --> n14
     n14@{ shape: diamond, label: "source_name" }
     n14 --> l12
     n15@{ shape: diamond, label: "write_method" }
     n15 --> l13
     n16@{ shape: diamond, label: "name" }
     n16 --> l14
     n16 --> n17
     n17@{ shape: diamond, label: "canonical_name" }
     n17 --> l15
     n17 --> n18
     n18@{ shape: diamond, label: "object_paths" }
     n18 --> l16
     n18 --> l17
     n19@{ shape: diamond, label: "attributes" }
     n19 --> n20
     n20@{ shape: diamond, label: "attribute" }
     n20 --> l18
     n20 --> n21
     n20 --> l20
     n21@{ shape: diamond, label: "value" }
     n21 --> l19
     n22@{ shape: diamond, label: "existence_condition" }
     n22 --> l21
     n22 --> n23
     n23@{ shape: diamond, label: "type" }
     n23 --> l22
     l0@{ shape: rect, label: "constant: 0" }
     l1@{ shape: rect, label: "modulus: infinity" }
     l2@{ shape: rect, label: "modular_value: 0" }
     l3@{ shape: rect, label: "minimum_value: 0" }
     l4@{ shape: rect, label: "maximum_value: 0" }
     l5@{ shape: rect, label: "constant: 4" }
     l6@{ shape: rect, label: "modulus: infinity" }
     l7@{ shape: rect, label: "modular_value: 4" }
     l8@{ shape: rect, label: "minimum_value: 4" }
     l9@{ shape: rect, label: "maximum_value: 4" }
     l10@{ shape: rect, label: "module_file: " }
     l11@{ shape: rect, label: "object_path: UInt" }
     l12@{ shape: rect, label: "text: UInt" }
     l13@{ shape: rect, label: "physical: True" }
     l14@{ shape: rect, label: "name: file_state" }
     l15@{ shape: rect, label: "module_file: testdata/golden/span" }
     l16@{ shape: rect, label: "object_path: LogFileStatus" }
     l17@{ shape: rect, label: "object_path: file_state" }
     l18@{ shape: rect, label: "name: byte_order" }
     l19@{ shape: rect, label: "string_constant: LittleEndian" }
     l20@{ shape: rect, label: "is_default: False" }
     l21@{ shape: rect, label: "boolean_constant: True" }
     l22@{ shape: rect, label: "boolean: True" }
 ```


 ### Front End vs Back End(s)

 The Emboss compiler is divided into a front end, which does most of the work,
 and back ends, which do language-specific validations and translate the final
 IR to the final output format.  Currently, only a C++ back end exists.

 The compiler is structured so that the front end and back end can run as
 separate programs, and when building with [Bazel][bazel] they do run
 separately.  For efficiency, the [`embossc`][embossc_source] driver just
 imports the front end and C++ back end directly, so that it can skip the JSON
 serialization and deserialization steps.

 [bazel]: https://bazel.build/
 [embossc_source]: ../embossc

 The front end consists of (as of the time of writing) 14 stages:

 1.  Tokenization
 2.  Parse Tree Generation
 3.  Parse Tree → IR
 4.  Desugaring + Built-In Field Synthesis
 5.  Symbol Resolution Part 1: Head Symbols
 6.  Dependency Cycle Checking
 7.  Dependency Order Computation
 8.  Symbol Resolution Part 2: Field Access
 9.  Type Annotation
 10. Type Checking
 11. Bounds Computation
 12. Front End Attribute Normalization + Verification
 13. Miscellaneous Constraint Checking
 14. Inferred Write Method Generation

 Each of these stages will be explained in more detail later in this document.

 The C++ back end is much simpler, with only 2 stages:

 1. Back-End Attribute Normalization + Verification
 2. C++ Header Generation

 This lopsidedness is intentional: work done in the front end can be shared by
 all back ends, but work in a back end only benefits that specific back end.


 ### IR Traversal

 Most stages walk the IR one (or more) times, processing or checking specific
 types of nodes.  The Emboss compiler uses a special
 [`fast_traverse_ir_top_down()`][traverse_ir] function that takes a *pattern*
 and calls a handler on all nodes that match that pattern.
 `fast_traverse_ir_top_down()` has been optimized to walk trees efficiently: it
 will skip branches if no node in the branch could possibly match the pattern.

 [traverse_ir]: ../compiler/util/traverse_ir.py#:~:def%20fast_traverse_ir_top_down

 For example, this will call `print()` on every `Word` in the IR:

 ```
 fast_traverse_ir_top_down(ir, [ir_data.Word], print)
 ```

 Or you could limit the scope to only `Word` nodes inside of `Import` nodes:

 ```
 fast_traverse_ir_top_down(ir, [ir_data.Import, ir_data.Word], print)
 ```

 `fast_traverse_ir_top_down()` has a lot of options.  For full usage, see
 [its documentation in the source code][traverse_ir].


 ### Error Handling

 The general scheme for running stages is:

 ```
 for stage in stages:
     next_ir, errors = stage(prev_ir)
     if errors:
         return None, errors
     prev_ir = next_ir
 return prev_ir, None
 ```

 That is: run each stage, and if the stage returns any errors, stop processing
 and return the error(s).

 Errors are represented by lists of [messages][error_py] (`error`, `warning` or
 `note`), where, generally, the first message will be an `error` or `warning`,
 and any subsequent messages are `note` messages that provide additional
 information.

 [error_py]: ../compiler/util/error.py


 #### Errors on Synthetic Nodes

 One important caveat here is that errors on *synthetic* IR nodes — nodes that
 did not come directly from the input `.emb` — are deferred: the compiler will
 continue processing to try to find an error in a "natural" node to display to
 the user.  This is covered in more detail in [Built-In Field
 Synthesis](#desugaring).


 ## Front End

 [*`compiler/front_end/...`*](../compiler/front_end/)

 The front end is responsible for reading in Emboss definitions and producing a
 normalized intermediate representation (IR).  It is divided into several steps:
 roughly, parsing, import resolution, symbol resolution, and validation.

 The front end is orchestrated by [`glue.py`][glue_py], which runs each front
 end component in the proper order to construct an IR suitable for consumption
 by the back end.

 [glue_py]: ../front_end/glue.py

 The front end driver program is [`emboss_front_end.py`][emboss_front_end_py],
 which calls `glue.ParseEmbossFile` and prints the results.

 [emboss_front_end_py]: ../front_end/emboss_front_end.py


 ### Tokenization and Parsing

 The first part of the front end translates the text of the `.emb` input into an
 IR structure.


 #### Tokenization

 [*`tokenizer.py`*](../compiler/front_end/tokenizer.py)

 The very first stage is tokenization, also called lexical analysis or lexing.
 Tokenization breaks the input text into *tokens* of one or more characters.
 For example, take the string `abc+def`:

 ```
 +---+---+---+---+---+---+---+
 | a | b | c | + | d | e | f |
 +---+---+---+---+---+---+---+
 ```

 This will be grouped into three tokens, `abc`, `+`, and `def`:

 ```
 +-----------+-----+-----------+
 |    abc    |  +  |    def    |
 | SnakeWord | "+" | SnakeWord |
 +-----------+-----+-----------+
 ```

 In addition, each token is labeled with a category, such as `Documentation`,
 `Number`, or `"*"`: by convention, tokens that are always a specific sequence
 of characters are given a label that is just the token in double quotes, such
 as `"+"` for `+`, `"=="` for `==`, `"if"` for `if`, and so on.

 These category labels match names in the grammar in the next step.

 There is one extra bit of logic in the Emboss tokenizer, which mimics CPython:
 indentation is tracked and translated into special `Indent` and `Dedent`
 tokens when it changes.  This fragment:

 ```emb
 struct Foo:
   -- word
 ```

 will be tokenized as:

 ```
 +----------+-----------+-----+------+--------+---------------+--------+
 |  struct  |    Foo    |  :  |  \n  |        |    -- word    |        |
 | "struct" | SnakeName | ":" | "\n" | Indent | Documentation | Dedent |
 +----------+-----------+-----+------+--------+---------------+--------+
 ```


 #### Parse Tree Generation

 [*`lr1.py`*](../compiler/front_end/lr1.py)

 The list of tokens is then passed to a parser, which produces a parse tree.
 The point of this stage is to take the list of tokens, and apply a tree
 structure on top of it.  Starting with the tokens from the previous example:

 ```mermaid
 graph LR
     colon@{label: "\":\""}
     newline@{label: "\"\\n\""}
     struct ~~~ Foo ~~~ colon ~~~ newline ~~~ Indent ~~~ Documentation ~~~ Dedent
 ```

 the parser infers the parse tree:

 ```mermaid
 graph TD
         r0@{shape: diamond, label: "module"}
         r1@{shape: diamond, label: "comment-line*"}
         r2@{shape: diamond, label: "doc-line*"}
         r3@{shape: diamond, label: "import-line*"}
         r4@{shape: diamond, label: "attribute-line*"}
         r5@{shape: diamond, label: "type-definition*"}
         r6@{shape: diamond, label: "type-definition"}
         r7@{shape: diamond, label: "struct"}
         r8@{shape: diamond, label: "type-name"}
         r9@{shape: diamond, label: "type-word"}
         r10@{shape: diamond, label: "delimited-parameter-definition-list?"}
         r11@{shape: diamond, label: "Comment?"}
         r12@{shape: diamond, label: "eol"}
         r13@{shape: diamond, label: "comment-line*"}
         r14@{shape: diamond, label: "struct-body"}
         r15@{shape: diamond, label: "doc-line*"}
         r16@{shape: diamond, label: "doc-line"}
         r17@{shape: diamond, label: "doc"}
         r18@{shape: diamond, label: "Comment?"}
         r19@{shape: diamond, label: "eol"}
         r20@{shape: diamond, label: "comment-line*"}
         r21@{shape: diamond, label: "doc-line*"}
         r22@{shape: diamond, label: "attribute-line*"}
         r23@{shape: diamond, label: "type-definition*"}
         r24@{shape: diamond, label: "struct-field-block"}
         r25@{shape: diamond, label: "type-definition*"}
     subgraph tokens
         direction LR
         t0@{shape: rect, label: "struct"}
         t1@{shape: rect, label: "Foo"}
         t2@{shape: rect, label: ":"}
         t3@{shape: rect, label: "\\n"}
         t4@{shape: rect, label: "Indent"}
         t5@{shape: rect, label: "-- word"}
         t6@{shape: rect, label: "\\n"}
         t7@{shape: rect, label: "Dedent"}
     end
     r0 --> r1
     r0 --> r2
     r0 --> r3
     r0 --> r4
     r0 --> r5
     r5 --> r6
     r5 --> r25
     r6 --> r7
     r7 ----> t0
     r7 --> r8
     r7 --> r10
     r7 ----> t2
     r7 --> r11
     r7 --> r12
     r7 --> r14
     r8 --> r9
     r9 ----> t1
     r12 ----> t3
     r12 --> r13
     r14 ----> t4
     r14 --> r15
     r14 --> r22
     r14 --> r23
     r14 --> r24
     r14 ----> t7
     r15 --> r16
     r15 --> r21
     r16 --> r17
     r16 --> r18
     r16 --> r19
     r17 ----> t5
     r19 ----> t6
     r19 --> r20
 ```

 Parsing is a large subject with many good tutorials and references available,
 so only the `embossc` specifics are covered here.

 The Emboss compiler uses a [shift-reduce (also known as "LR(1)")
 parser][shift_reduce], using a [custom engine][lr1_py] that has some features
 that are not available in most other shift-reduce engines — and also some
 limitations.

 [shift_reduce]: https://en.wikipedia.org/wiki/Shift-reduce_parser
 [lr1_py]: ../compiler/front_end/lr1.py

 The biggest limitation is that the table generator only implements the
 [canonical LR(1)][canonical_lr1] table generation algorithm, which means that
 there is no "side channel" way of specifying precedence or otherwise resolving
 conflicts — shift/reduce and reduce/reduce conflicts must be resolved by
 removing them from the grammar.  (However, Emboss's grammar uses
 partially-ordered precedence, which — as far as the author is aware — is not
 supported by any off-the-shelf precedence system.)

 [canonical_lr1]: https://doi.org/10.1016/S0019-9958(65)90426-2

 The biggest unusual feature is the incorporation of a [*Merr*][merr]-like
 system for error messages.

 [merr]: https://doi.org/10.1145/937563.937566

 The output from this stage is a *parse tree*, not an IR.


 #### Parse Tree → IR

 [*`module_ir.py`*](../compiler/front_end/module_ir.py)

 Finally, the parse tree is transformed into an IR.  This is mostly a
 straightforward translation, but there are a few places where the parse tree is
 noticeably different from the IR.  For example, in the parse tree, lists are
 explicit *cons* lists, like:

 ```mermaid
 graph TD
     r1@{label: "struct-field-block"}
     r2@{label: "struct-field*"}
     r3@{label: "struct-field"}
     r4@{label: "struct-field*"}
     r5@{label: "struct-field"}
     r6@{label: "struct-field*"}
     r7@{label: "struct-field"}
     r8@{label: "struct-field*"}
     r1 --> r2
     r2 --> r3
     r2 --> r4
     r4 --> r5
     r4 --> r6
     r6 --> r7
     r6 --> r8
 ```

 while in the IR these lists are flattened:

 ```mermaid
 graph TD
     n0@{label: "structure"}
     n1@{label: "fields"}
     n2@{label: "field"}
     n3@{label: "field"}
     n4@{label: "field"}
     n0 --> n1
     n1 --> n2
     n1 --> n3
     n1 --> n4
 ```

 One bit of pure syntax sugar that is handled at this stage is chained
 comparisons: `a == b == c` will be translated to the IR equivalent of `a == b
 && b == c`.


 ### IR Processing

 <a name="desugaring"></a>

 #### Desugaring + Built-In Field Synthesis

 [*`synthetics.py`*](../compiler/front_end/synthetics.py)

 The first stage that operates on the IR adds built-in fields like
 `$size_in_bytes` and rewrites some constructs to a more regular form.
 Specifically, anonymous `bits` will be rewritten into non-anonymous `bits`, and
 `$next` will be replaced with the offset + size of the previous field.
 Starting with this:

 ```emb
 struct Foo:
     0 [+4]  bits:
         0  [+1]  Flag  low
         31 [+1]  Flag  high
     if low:
         $next [+4]  UInt  field
 ```

 `synthetics.py` will rewrite the IR into (the equivalent of):

 ```emb
 struct Foo:
     bits EmbossReservedAnonymous0:
         [text_output: "Skip"]
         0  [+1]  Flag  low
         31 [+1]  Flag  high
     let $size_in_bytes = $max(0, true ? 0+4 : 0, low ? (0+4)+4 : 0)
     let $max_size_in_bytes = $upper_bound($size_in_bytes)
     let $min_size_in_bytes = $lower_bound($size_in_bytes)
     0 [+4]  EmbossReservedAnonymous0  emboss_reserved_anonymous_1
     if low:
         0+4 [+4]  UInt  field
     let low = emboss_reserved_anonymous_1.low
     let high = emboss_reserved_anonymous_1.high
 ```

 One important detail is that the newly-created nodes are marked as *synthetic*.
 This makes a difference for error handling: because of the way that some of the
 synthetic nodes are constructed from copied pieces of existing IR, errors can
 sometimes be detected in synthetic nodes before they are detected in the
 original source nodes they were copied from.  These errors tend to be very
 confusing to end users, since they reference parts of the IR that were not
 entered by an end user.  For example, the incorrect:

 ```emb
 struct Foo:
     false [+4]  UInt  field
 ```

 will have a synthesized `$size_in_bytes` field:

 ```emb
 struct Foo:
     let $size_in_bytes = $max(0, true ? false+4 : 0)
     false [+4]  UInt  field
 ```

 The erroneous expression `false+4` will be detected before the top-level type
 error of using `false` as the field offset, with a message like:

 ```
 example.emb:2:5: error: Left argument of operator '+' must be an integer.
     false [+4]  UInt  field
     ^^^^^
 ```

 To avoid confusing end users, any error with a synthetic location will be
 *deferred*, and only shown to an end user if no non-synthetic errors are
 encountered during IR processing.  In this case, the correct error message will
 be found and displayed to the end user:

 ```
 example.emb:2:5: error: Start of field must be an integer.
     false [+4]  UInt  field
     ^^^^^
 ```


 #### Symbol Resolution Part 1: Head Symbols

 [*`symbol_resolver.py`*](../compiler/front_end/symbol_resolver.py)

 After desugaring, the first part of symbol resolution occurs: resolving *head*
 symbols.  Head symbols are any symbols that do not follow a `.`: for example,
 `head` or `Head` in `head`, `Head.field` and `head.Type.field`.  Resolution of
 symbols after `.` happens in a later stage, after some other checking.

 Head symbol resolution follows lexical scoping rules: a symbol may refer to
 entities in a set of *scopes* determined by their position in the source text.

 Head symbol resolution is straightforward: while walking down the IR tree,
 build up a list of available scopes.  When a symbol is encountered, check to
 see if that symbol is *defined* in any of the available scopes.  If it is
 defined in *exactly one* available scope, bind the symbol to that definition.
 If it is not defined, raise an error.  Emboss takes the unusual step of *also*
 raising an error if the symbol is defined in two or more scopes: in most
 computer languages, there is some scope precedence, and the highest-precedence
 definition is used.  However, this can lead to surprising, and in some cases
 *silently incorrect* results when the desired definition is *shadowed* by an
 unknown-to-the-user higher-precedence definition, so Emboss raises an error and
 requires the user to explicitly pick the correct definition.


 #### Dependency Cycle Checking

 [*`dependency_checker.py`*](../compiler/front_end/dependency_checker.py)

 Once head symbols have been resolved, there is enough information in the IR to
 check for cycles in the dependency graph: cases where `object_1` depends on
 `object_2` and `object_2` also depends on `object_1`, sometimes through a chain
 of multiple dependencies.

 The Emboss compiler accomplishes this by first building up the complete
 dependency graph of the IR, and then running a graph partitioning algorithm
 ([Tarjan's algorithm][tarjan], in this case) on the dependency graph to find
 all strongly-connected components.  If there are any partitions with more than
 one node, or any nodes with a self edge, then there is a dependency cycle.

 [tarjan]: https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm


 #### Dependency Order Computation

 [*`dependency_checker.py`*](../compiler/front_end/dependency_checker.py)

 Every structure in the IR is also annotated with a 'dependency ordering': an
 ordering of its fields such that any field *A* that depends on another field
 *B* is later in the ordering than *B*.

 This ordering is not used until the back end (where it is used to order fields
 for text output), but it is convenient to build it here, where the dependency
 graph has already been built.


 #### Symbol Resolution Part 2: Field Access

 [*`symbol_resolver.py`*](../compiler/front_end/symbol_resolver.py)

 Next, "tail" symbols are resolved: any symbol immediately after `.`.  These are
 resolved after checking for dependency cycles because the resolution algorithm
 can become stuck in infinite recursion if there is a dependency cycle.


 #### Type Annotation

 [*`type_check.py`*](../compiler/front_end/type_check.py)

 With symbols resolved, it is possible to annotate expression types.  The Emboss
 compiler annotates *every* expression and subexpression with a type: at this
 stage, those types are coarse: integer, enum, boolean, string, "opaque", etc.

 This stage also checks expressions for *internal* type consistency.  For
 example, it will find the error in:

 ```emb
 struct Foo:
     let a = 10 + true
 ```

 but not in:

 ```emb
 struct Foo:
     true [+4]  UInt  a
 ```


 #### Type Checking

 [*`type_check.py`*](../compiler/front_end/type_check.py)

 After annotating all types, the compiler checks that expressions have the
 correct top-level types: that is, that their types are correct for their
 positions within the IR.  This is the stage that detects the use of a boolean
 as a field offset from the previous example:

 ```emb
 struct Foo:
     true [+4]  UInt  a
 ```


 #### Bounds Computation

 [*`expression_bounds.py`*](../compiler/front_end/expression_bounds.py)

 With coarse type checking done, the expression types can be *refined* into
 narrower types: for example, instead of just "integer," "integer between 0 and
 65535, inclusive."

 Fully understanding this stage involves (just a tiny bit!) of Type Theory:
 mainly, that a "type" in the computer science sense is just a *set* (in the
 mathematical sense) of all the values that a variable of that type could
 possibly hold.  This sort of "type" is conflated with physical layout in most
 industrial computing languages (including Emboss), but is conceptually
 separate.

 The boolean type is the set $\left\\{ \scriptstyle{FALSE}, \scriptstyle{TRUE}
 \right\\}$, so its only refinements (nonempty subsets) are $\left\\{
 \scriptstyle{FALSE} \right\\}$ and $\left\\{ \scriptstyle{TRUE} \right\\}$.
 This stage will try to make those refinements, where possible.

 The base integer type in the Emboss expression language is the set of all
 integers, normally written as $\mathbb{Z}$.  It is obviously impossible for any
 real-world computer to have a variable of type $\mathbb{Z}$ — there is no
 computer with infinite memory — so this type needs to be refined.

 In Emboss, currently, integer types are refined into the form $\left\\{ x | x
 \in \mathbb{Z} \wedge a \le x \le b \wedge x \equiv n \pmod m \right\\}$: that
 is, $x$ is an integer ($x \in \mathbb{Z}$), $x$ is at least $a$ and at most $b$
 ($a \le x \le b$), and $x$ % $m$ == $n$ ($x \equiv n \pmod m$).  The values $a$
 and $b$ are lower and upper bounds, respectively, while $m$ and $n$ are used
 for alignment tracking — $m$ tracks the "level" of alignment, and $n$ tracks
 the "offset" within the alignment.

 This integer type refinement starts with known types.  A value read from an
 `Int` or `UInt` field will have known bounds (based on the bit width of the
 field), $m = 1$ and $n = 0$.  A constant $c$ has $a = b = n = c$ and $m =
 \infty$.

 From this basis, the refinements for any operations can be determined.  For
 example, for `x + y`, the resulting lower and upper bounds $a = a_x + a_y$ and
 $b = b_x + b_y$.  The resulting alignment is a little more complex, involving
 the greatest common divisor function.

 The Emboss functions `$lower_bound(x)` and `$upper_bound(x)` are a little
 strange: their result is a constant that is the lower or upper bound
 (respectively) of their argument!  That is, they extract a value from the type
 system and return it as an expression.  There aren't a lot of human-authored
 uses of these functions, but they are used in the definitions of fields like
 `$min_size_in_bytes` and `$max_size_in_bits`.

 The refinements that Emboss finds are not always perfectly optimal, for various
 reasons: for example, even if $x < 0$, `x * x` cannot be negative, but Emboss
 does not try to detect any *cross-argument constraints*, so it will not figure
 this out.

 It might be tempting to add better refinement to Emboss, but there are
 downsides.  First, new refinements take up compiler time, which can contribute
 to slow builds.  Many possible refinements would only be found on a tiny
 fraction of all expressions, so it is not worth the compiler time to search for
 them.

 Second, and probably more importantly, any new refinements become part of the
 Emboss contract with end users, forever: `$upper_bound()` must never return a
 larger value than it did in previous versions, and `$lower_bound()` must never
 return a smaller value.  Reduced alignment bounds can cause performance
 regressions in the highest-performance code.  This has direct implications for
 maintainence of the current Emboss compiler, but it also makes it even harder
 for any third-party compiler to fully implement Emboss — which, in turn, means
 that Emboss is even more likely to be a single-vendor language, like Python or
 Ruby, instead of a multi-vendor language like C++ or JavaScript.


 #### Front End Attribute Normalization + Verification

 [*`attribute_checker.py`*](../compiler/front_end/attribute_checker.py`)

 This stage checks the attributes (annotations like `[byte_order:
 "LittleEndian"]`) that are used by the Emboss front end, and adds missing
 attributes where needed.

 Attributes were originally intended to be a way to add functionality without
 having to add new syntax, so their semantics are not conceptually cohesive.
 Nonetheless, they are all handled together in the same stage.

 This stage checks that all attributes have the correct argument type (no
 `[byte_order: 10]`) and that their values are valid (no `[byte_order:
 "scrambled"]`).  `$default`'ed attributes have their values copied into every
 place in the IR where they make sense and were not overridden (so `[$default
 byte_order: "BigEndian"]` will cause `[byte_order: "BigEndian"]` to be attached
 to most fields that do not already have a `[byte_order]` attribute).


 #### Miscellaneous Constraint Checking

 [*`constraints.py`*](../compiler/front_end/constraints.py)

 Even more conceptually incoherent than the previous stage, constraint checking
 verifies that certain things are true about the IR:

 *   Fields in `bits` have types that are allowed in `bits`.
 *   Arrays have constant sizes for all dimensions other than the last one,
     which may be empty.
 *   Arrays in `struct` are arrays of types that are a whole number of bytes
     long (no `UInt:4[]`).
 *   In `T[]`, `T` has a constant size.
 *   That fields with `external` types like `Int`, `Flag`, etc. meet the
     requirements for those types, in terms of size.
 *   That enum, field, and type names are not reserved words in any existing
     mainstream programming language as of 2014, even if they are not Emboss
     keywords.  For example, no field may be named `class`, even though Emboss
     does not care.
 *   That constant references (references of the form `Type.field`) are actually
     constant.
 *   That all enum values can be represented as 64-bit unsigned or signed
     2's-complement integers (or narrower, if the `enum` has an explicit,
     narrower width).
 *   That all expression values *that might be computed at runtime* fit within
     64-bit unsigned or signed 2's-complement integers.

 The common theme of these is that they are not used anywhere in the front end,
 but they simplify the back end by reducing the cases that have to be handled.

 Most of these constraints could be lifted at some point.


 #### Inferred Write Method Generation

 [*`write_inference.py`*](../compiler/front_end/write_inference.py)

 This stage — the last before the IR goes to the band end — figures out how to
 write through virtual fields.  There are three possibilities:

 If the virtual field is a direct alias (`let field = some_other_field`), then
 writing through `field` is exactly the same as writing through
 `some_other_field`.  In fact, it is *so* exactly the same that the C++ back end
 handles this by making the `field()` method call just return the view of
 `some_other_field()` directly.  This works even if `some_other_field` is
 actually a nested field, like `some.other.field`.

 This method is most important for anonymous `bits`: it is important to be able
 to write through the aliases because the original field is not accessible to
 end users.

 If the virtual field is a series of additions and subtractions on exactly one
 other field, like `let f_plus_10 = f + 10`, then a special "write" expression
 will be generated that inverts the calculation: in this case, `f_plus_10 - 10`.
 It was the intention of the Emboss authors to expand this inference over time,
 but (currently) addition and subtraction are the only fully-invertible
 operations in the Emboss expression language.

 Lastly, if neither of the other options apply, the field is read-only.

 It was also the intention to open this feature up to users by allowing "write
 method" attributes to be attached to virtual fields, like:

 ```emb
 # An example from an actual device, where `x` is a frequency expressed in
 # either Hz or MHz.
 let f = x < 1000 ? x * 1000000 : x
     [write_field: x]
     [write_expression: value]
 ```


 ## C++ Back End

 [*`compiler/back_end/cpp`*](../compiler/back_end/cpp)

 The C++ back end transforms the fully elaborated IR into a header file.


 ### C++ Back-End Attribute Verification

 [*`attributes.py`*](../compiler/back_end/cpp/attributes.py)

 [*`header_generator.py`*](../compiler/back_end/cpp/header_generator.py)

 The first stage of the C++ back end is a short one: it verifies that the
 `[(cpp) namespace]` attribute, if present, is a valid C++ namespace, and that
 any `[(cpp) enum_case]` attributes are valid.


 ### C++ Header Generation

 [*`header_generator.py`*](../compiler/back_end/cpp/header_generator.py)

 The very last stage walks the IR and produces the text of a C++ `.h` file.

 Unlike most stages, C++ header generation does *not* use
 `fast_traverse_ir_top_down()`: the code to walk the tree is manual, generally
 assembling fragments of C++ code from the bottom up.
 During most steps of this traversal, the C++ back end builds up three separate
 sections: a forward declaration section, a class definition section, and then a
 method definition section, which are eventually emitted in that specific order.
 This structure lets any circular references (e.g., Emboss structure types that
 mutually contain each other) work without the back end having to track anything
 special: by the time a type is referenced, it will always have been declared,
 and by the time a type is actually *used* it will always have been defined.

 This stage has the most code of any single stage of the Emboss compiler, so it
 can be difficult to follow.  (It is also some of the oldest code in the
 compiler, which does not help.)