Ben Olmstead | c0d7784 | 2019-07-31 17:34:05 -0700 | [diff] [blame] | 1 | # Design of the Emboss Tool |
| 2 | |
| 3 | This document describes the internals of Emboss. End users do not need to read |
| 4 | this document. |
| 5 | |
| 6 | *TODO(bolms): Update this doc to include the newer passes.* |
| 7 | |
| 8 | The Emboss compiler is divided into separate "front end" and "back end" |
| 9 | programs. The front end parses Emboss files (`.emb` files) and produces a |
| 10 | stable intermediate representation (IR), which is consumed by the back ends. |
Ben Olmstead | 90ee0e8 | 2019-09-23 19:04:47 -0700 | [diff] [blame] | 11 | This IR is defined in [public/ir_pb2.py][ir_pb2_py]. |
Ben Olmstead | c0d7784 | 2019-07-31 17:34:05 -0700 | [diff] [blame] | 12 | |
Ben Olmstead | b3df29b | 2019-09-23 20:01:40 -0700 | [diff] [blame] | 13 | [ir_pb2_py]: public/ir_pb2.py |
Ben Olmstead | c0d7784 | 2019-07-31 17:34:05 -0700 | [diff] [blame] | 14 | |
| 15 | The back ends read the IR and emit code to view and manipulate Emboss-defined |
| 16 | data structures. Currently, only a C++ back-end exists. |
| 17 | |
| 18 | *TODO(bolms): Split the symbol resolution and validation steps in a separate |
| 19 | "middle" component, to allow external code generators to generate undecorated |
| 20 | Emboss IR instead of Emboss source text?* |
| 21 | |
| 22 | ## Front End |
| 23 | |
| 24 | *Implemented in [front_end/...][front_end]* |
| 25 | |
| 26 | [front_end]: front_end/ |
| 27 | |
| 28 | The front end is responsible for reading in Emboss definitions and producing a |
| 29 | normalized intermediate representation (IR). It is divided into several steps: |
| 30 | roughly, parsing, import resolution, symbol resolution, and validation. |
| 31 | |
| 32 | The front end is orchestrated by [glue.py][glue_py], which runs each front end |
| 33 | component in the proper order to construct an IR suitable for consumption by the |
| 34 | back end. |
| 35 | |
| 36 | [glue_py]: front_end/glue.py |
| 37 | |
| 38 | The actual driver program is [emboss_front_end.py][emboss_front_end_py], which |
| 39 | just calls `glue.ParseEmbossFile` and prints the results. |
| 40 | |
| 41 | [emboss_front_end_py]: front_end/emboss_front_end.py |
| 42 | |
| 43 | ### File Parsing |
| 44 | |
| 45 | Per-file parsing consumes the text of a single Emboss module, and produces an |
| 46 | "undecorated" IR for the module, containing only syntactic-level information |
| 47 | from the module. |
| 48 | |
| 49 | This "undecorated" IR is (almost) a subset of the final IR: later steps will add |
| 50 | information and perform validation, but will rarely remove anything from the IR |
| 51 | before it is emitted. |
| 52 | |
| 53 | #### Tokenization |
| 54 | |
| 55 | *Implemented in [tokenizer.py][tokenizer_py]* |
| 56 | |
| 57 | [tokenizer_py]: front_end/tokenizer.py |
| 58 | |
| 59 | The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la |
| 60 | Python. It divides source text into `parse_types.Symbol` objects, suitable for |
| 61 | feeding into the parser. |
| 62 | |
| 63 | #### Syntax Tree Generation |
| 64 | |
| 65 | *Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]* |
| 66 | |
| 67 | [lr1_py]: front_end/lr1.py |
| 68 | [parser_generator_py]: front_end/parser_generator.py |
| 69 | [structure_parser_py]: front_end/structure_parser.py |
| 70 | |
| 71 | Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in |
| 72 | three parts in Emboss: |
| 73 | |
| 74 | * A generic parser generator implementing the table generation algorithms from |
| 75 | *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the |
| 76 | error-marking algorithm from *[Generating LR Syntax Error Messages from |
| 77 | Examples][jeffery_2003]*. |
| 78 | * An Emboss-specific parser builder which glues the Emboss tokenizer, grammar, |
| 79 | and error examples to the parser generator, producing an Emboss parser. |
| 80 | * The Emboss grammar, which is extracted from the file normalizer |
| 81 | (*[module_ir.py][module_ir_py]*). |
| 82 | |
| 83 | [dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 |
| 84 | [jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566 |
| 85 | |
| 86 | #### Normalization |
| 87 | |
| 88 | *Implemented in [module_ir.py][module_ir_py]* |
| 89 | |
| 90 | [module_ir_py]: front_end/module_ir.py |
| 91 | |
| 92 | Once a parse tree has been generated, it is fed into a normalizer which |
| 93 | recursively turns the raw syntax tree into a "first stage" intermediate |
| 94 | representation (IR). The first stage IR serves to isolate later stages from |
| 95 | minor changes in the grammar, but only contains information from a single file, |
| 96 | and does not perform any semantic checking. |
| 97 | |
| 98 | ### Import Resolution |
| 99 | |
| 100 | *TODO(bolms): Implement imports.* |
| 101 | |
| 102 | After each file is parsed, any new imports it has are added to a work queue. |
| 103 | Each file in the work queue is parsed, potentially adding more imports to the |
| 104 | queue, until the queue is empty. |
| 105 | |
| 106 | ### Symbol Resolution |
| 107 | |
| 108 | *Implemented in [symbol_resolver.py][symbol_resolver_py]* |
| 109 | |
| 110 | [symbol_resolver_py]: front_end/symbol_resolver.py |
| 111 | |
| 112 | Symbol resolution is the process of correlating names in the IR. At the end of |
| 113 | symbol resolution, every named entity (type definition, field definition, enum |
| 114 | name, etc.) has a `CanonicalName`, and every reference in the IR has a |
| 115 | `Reference` to the entity to which it refers. |
| 116 | |
| 117 | This assignment occurs in two passes. First, the full IR is scanned, generating |
| 118 | scoped symbol tables (nested dictionaries of names to `CanonicalName`), and |
| 119 | assigning identities to each `Name` in the IR. Then the IR is fully scanned a |
| 120 | second time, and each `Reference` in the IR is resolved: all scopes visible to |
| 121 | the reference are scanned for the name, and the corresponding `CanonicalName` is |
| 122 | assigned to the reference. |
| 123 | |
| 124 | ### Validation |
| 125 | |
| 126 | *TODO(bolms): other validations?* |
| 127 | |
| 128 | #### Size Checking |
| 129 | |
| 130 | *TODO(bolms): describe* |
| 131 | |
| 132 | #### Overlap Checking |
| 133 | |
| 134 | *TODO(bolms): describe* |
| 135 | |
| 136 | ## Back End |
| 137 | |
| 138 | *Implemented in [back_end/...][back_end]* |
| 139 | |
| 140 | [back_end]: back_end/ |
| 141 | |
| 142 | Currently, only a C++ back end is implemented. |
| 143 | |
| 144 | A back end takes Emboss IR and produces code in a specific language for |
| 145 | manipulating the Emboss-defined data structures. |
| 146 | |
| 147 | ### C++ |
| 148 | |
| 149 | *Implemented in [header_generator.py][header_generator_py] with templates in |
| 150 | [generated_code_templates][generated_code_templates], support code in |
| 151 | [emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in |
| 152 | [emboss_codegen_cpp.py][emboss_codegen_cpp_py]* |
| 153 | |
| 154 | [header_generator_py]: back_end/cpp/header_generator.py |
| 155 | [generated_code_templates]: back_end/cpp/generated_code_templates |
| 156 | [emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h |
| 157 | [emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py |
| 158 | |
| 159 | The C++ code generator is currently very minimal. `header_generator.py` |
| 160 | essentially inserts values from the IR into text templates. |
| 161 | |
| 162 | *TODO(bolms): add more documentation once the C++ back end has more features.* |