Design of the Emboss Tool

This document describes the internals of Emboss. End users do not need to read this document.

TODO(bolms): Update this doc to include the newer passes.

The Emboss compiler is divided into separate “front end” and “back end” programs. The front end parses Emboss files (.emb files) and produces a stable intermediate representation (IR), which is consumed by the back ends. This IR is defined in public/ir_data.py.

The back ends read the IR and emit code to view and manipulate Emboss-defined data structures. Currently, only a C++ back-end exists.

TODO(bolms): Split the symbol resolution and validation steps in a separate “middle” component, to allow external code generators to generate undecorated Emboss IR instead of Emboss source text?

Front End

Implemented in front_end/...

The front end is responsible for reading in Emboss definitions and producing a normalized intermediate representation (IR). It is divided into several steps: roughly, parsing, import resolution, symbol resolution, and validation.

The front end is orchestrated by glue.py, which runs each front end component in the proper order to construct an IR suitable for consumption by the back end.

The actual driver program is emboss_front_end.py, which just calls glue.ParseEmbossFile and prints the results.

File Parsing

Per-file parsing consumes the text of a single Emboss module, and produces an “undecorated” IR for the module, containing only syntactic-level information from the module.

This “undecorated” IR is (almost) a subset of the final IR: later steps will add information and perform validation, but will rarely remove anything from the IR before it is emitted.

Tokenization

Implemented in tokenizer.py

The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la Python. It divides source text into parse_types.Symbol objects, suitable for feeding into the parser.

Syntax Tree Generation

Implemented in lr1.py and parser_generator.py, with a façade in structure_parser.py

Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in three parts in Emboss:

A generic parser generator implementing the table generation algorithms from Compilers: Principles, Techniques, & Tools and the error-marking algorithm from Generating LR Syntax Error Messages from Examples.
An Emboss-specific parser builder which glues the Emboss tokenizer, grammar, and error examples to the parser generator, producing an Emboss parser.
The Emboss grammar, which is extracted from the file normalizer (module_ir.py).

Normalization

Implemented in module_ir.py

Once a parse tree has been generated, it is fed into a normalizer which recursively turns the raw syntax tree into a “first stage” intermediate representation (IR). The first stage IR serves to isolate later stages from minor changes in the grammar, but only contains information from a single file, and does not perform any semantic checking.

Import Resolution

TODO(bolms): Implement imports.

After each file is parsed, any new imports it has are added to a work queue. Each file in the work queue is parsed, potentially adding more imports to the queue, until the queue is empty.

Symbol Resolution

Implemented in symbol_resolver.py

Symbol resolution is the process of correlating names in the IR. At the end of symbol resolution, every named entity (type definition, field definition, enum name, etc.) has a CanonicalName, and every reference in the IR has a Reference to the entity to which it refers.

This assignment occurs in two passes. First, the full IR is scanned, generating scoped symbol tables (nested dictionaries of names to CanonicalName), and assigning identities to each Name in the IR. Then the IR is fully scanned a second time, and each Reference in the IR is resolved: all scopes visible to the reference are scanned for the name, and the corresponding CanonicalName is assigned to the reference.

Validation

TODO(bolms): other validations?

Size Checking

TODO(bolms): describe

Overlap Checking

TODO(bolms): describe

Back End

Implemented in back_end/...

Currently, only a C++ back end is implemented.

A back end takes Emboss IR and produces code in a specific language for manipulating the Emboss-defined data structures.

C++

Implemented in header_generator.py with templates in generated_code_templates, support code in emboss_cpp_util.h, and a driver program in emboss_codegen_cpp.py

The C++ code generator is currently very minimal. header_generator.py essentially inserts values from the IR into text templates.

TODO(bolms): add more documentation once the C++ back end has more features.