blob: afa3ea142a5375df08cdbe2b2600ff420aee2713 [file] [log] [blame] [view]
Ben Olmsteadc0d77842019-07-31 17:34:05 -07001# Design of the Emboss Tool
2
3This document describes the internals of Emboss. End users do not need to read
4this document.
5
6*TODO(bolms): Update this doc to include the newer passes.*
7
8The Emboss compiler is divided into separate "front end" and "back end"
9programs. The front end parses Emboss files (`.emb` files) and produces a
10stable intermediate representation (IR), which is consumed by the back ends.
Ben Olmstead90ee0e82019-09-23 19:04:47 -070011This IR is defined in [public/ir_pb2.py][ir_pb2_py].
Ben Olmsteadc0d77842019-07-31 17:34:05 -070012
Ben Olmsteadb3df29b2019-09-23 20:01:40 -070013[ir_pb2_py]: public/ir_pb2.py
Ben Olmsteadc0d77842019-07-31 17:34:05 -070014
15The back ends read the IR and emit code to view and manipulate Emboss-defined
16data structures. Currently, only a C++ back-end exists.
17
18*TODO(bolms): Split the symbol resolution and validation steps in a separate
19"middle" component, to allow external code generators to generate undecorated
20Emboss IR instead of Emboss source text?*
21
22## Front End
23
24*Implemented in [front_end/...][front_end]*
25
26[front_end]: front_end/
27
28The front end is responsible for reading in Emboss definitions and producing a
29normalized intermediate representation (IR). It is divided into several steps:
30roughly, parsing, import resolution, symbol resolution, and validation.
31
32The front end is orchestrated by [glue.py][glue_py], which runs each front end
33component in the proper order to construct an IR suitable for consumption by the
34back end.
35
36[glue_py]: front_end/glue.py
37
38The actual driver program is [emboss_front_end.py][emboss_front_end_py], which
39just calls `glue.ParseEmbossFile` and prints the results.
40
41[emboss_front_end_py]: front_end/emboss_front_end.py
42
43### File Parsing
44
45Per-file parsing consumes the text of a single Emboss module, and produces an
46"undecorated" IR for the module, containing only syntactic-level information
47from the module.
48
49This "undecorated" IR is (almost) a subset of the final IR: later steps will add
50information and perform validation, but will rarely remove anything from the IR
51before it is emitted.
52
53#### Tokenization
54
55*Implemented in [tokenizer.py][tokenizer_py]*
56
57[tokenizer_py]: front_end/tokenizer.py
58
59The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la
60Python. It divides source text into `parse_types.Symbol` objects, suitable for
61feeding into the parser.
62
63#### Syntax Tree Generation
64
65*Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]*
66
67[lr1_py]: front_end/lr1.py
68[parser_generator_py]: front_end/parser_generator.py
69[structure_parser_py]: front_end/structure_parser.py
70
71Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in
72three parts in Emboss:
73
74* A generic parser generator implementing the table generation algorithms from
75 *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the
76 error-marking algorithm from *[Generating LR Syntax Error Messages from
77 Examples][jeffery_2003]*.
78* An Emboss-specific parser builder which glues the Emboss tokenizer, grammar,
79 and error examples to the parser generator, producing an Emboss parser.
80* The Emboss grammar, which is extracted from the file normalizer
81 (*[module_ir.py][module_ir_py]*).
82
83[dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
84[jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566
85
86#### Normalization
87
88*Implemented in [module_ir.py][module_ir_py]*
89
90[module_ir_py]: front_end/module_ir.py
91
92Once a parse tree has been generated, it is fed into a normalizer which
93recursively turns the raw syntax tree into a "first stage" intermediate
94representation (IR). The first stage IR serves to isolate later stages from
95minor changes in the grammar, but only contains information from a single file,
96and does not perform any semantic checking.
97
98### Import Resolution
99
100*TODO(bolms): Implement imports.*
101
102After each file is parsed, any new imports it has are added to a work queue.
103Each file in the work queue is parsed, potentially adding more imports to the
104queue, until the queue is empty.
105
106### Symbol Resolution
107
108*Implemented in [symbol_resolver.py][symbol_resolver_py]*
109
110[symbol_resolver_py]: front_end/symbol_resolver.py
111
112Symbol resolution is the process of correlating names in the IR. At the end of
113symbol resolution, every named entity (type definition, field definition, enum
114name, etc.) has a `CanonicalName`, and every reference in the IR has a
115`Reference` to the entity to which it refers.
116
117This assignment occurs in two passes. First, the full IR is scanned, generating
118scoped symbol tables (nested dictionaries of names to `CanonicalName`), and
119assigning identities to each `Name` in the IR. Then the IR is fully scanned a
120second time, and each `Reference` in the IR is resolved: all scopes visible to
121the reference are scanned for the name, and the corresponding `CanonicalName` is
122assigned to the reference.
123
124### Validation
125
126*TODO(bolms): other validations?*
127
128#### Size Checking
129
130*TODO(bolms): describe*
131
132#### Overlap Checking
133
134*TODO(bolms): describe*
135
136## Back End
137
138*Implemented in [back_end/...][back_end]*
139
140[back_end]: back_end/
141
142Currently, only a C++ back end is implemented.
143
144A back end takes Emboss IR and produces code in a specific language for
145manipulating the Emboss-defined data structures.
146
147### C++
148
149*Implemented in [header_generator.py][header_generator_py] with templates in
150[generated_code_templates][generated_code_templates], support code in
151[emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in
152[emboss_codegen_cpp.py][emboss_codegen_cpp_py]*
153
154[header_generator_py]: back_end/cpp/header_generator.py
155[generated_code_templates]: back_end/cpp/generated_code_templates
156[emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h
157[emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py
158
159The C++ code generator is currently very minimal. `header_generator.py`
160essentially inserts values from the IR into text templates.
161
162*TODO(bolms): add more documentation once the C++ back end has more features.*