Blame - doc/design.md - third_party/github/google/emboss

blob: afa3ea142a5375df08cdbe2b2600ff420aee2713 [file] [log] [blame] [view]

Ben Olmstead	c0d7784	2019-07-31 17:34:05 -0700	[diff] [blame]	1	# Design of the Emboss Tool
				2
				3	This document describes the internals of Emboss. End users do not need to read
				4	this document.
				5
				6	TODO(bolms): Update this doc to include the newer passes.
				7
				8	The Emboss compiler is divided into separate "front end" and "back end"
				9	programs. The front end parses Emboss files (`.emb` files) and produces a
				10	stable intermediate representation (IR), which is consumed by the back ends.
Ben Olmstead	90ee0e8	2019-09-23 19:04:47 -0700	[diff] [blame]	11	This IR is defined in [public/ir_pb2.py][ir_pb2_py].
Ben Olmstead	c0d7784	2019-07-31 17:34:05 -0700	[diff] [blame]	12
Ben Olmstead	b3df29b	2019-09-23 20:01:40 -0700	[diff] [blame]	13	[ir_pb2_py]: public/ir_pb2.py
Ben Olmstead	c0d7784	2019-07-31 17:34:05 -0700	[diff] [blame]	14
				15	The back ends read the IR and emit code to view and manipulate Emboss-defined
				16	data structures. Currently, only a C++ back-end exists.
				17
				18	*TODO(bolms): Split the symbol resolution and validation steps in a separate
				19	"middle" component, to allow external code generators to generate undecorated
				20	Emboss IR instead of Emboss source text?*
				21
				22	## Front End
				23
				24	Implemented in [front_end/...][front_end]
				25
				26	[front_end]: front_end/
				27
				28	The front end is responsible for reading in Emboss definitions and producing a
				29	normalized intermediate representation (IR). It is divided into several steps:
				30	roughly, parsing, import resolution, symbol resolution, and validation.
				31
				32	The front end is orchestrated by [glue.py][glue_py], which runs each front end
				33	component in the proper order to construct an IR suitable for consumption by the
				34	back end.
				35
				36	[glue_py]: front_end/glue.py
				37
				38	The actual driver program is [emboss_front_end.py][emboss_front_end_py], which
				39	just calls `glue.ParseEmbossFile` and prints the results.
				40
				41	[emboss_front_end_py]: front_end/emboss_front_end.py
				42
				43	### File Parsing
				44
				45	Per-file parsing consumes the text of a single Emboss module, and produces an
				46	"undecorated" IR for the module, containing only syntactic-level information
				47	from the module.
				48
				49	This "undecorated" IR is (almost) a subset of the final IR: later steps will add
				50	information and perform validation, but will rarely remove anything from the IR
				51	before it is emitted.
				52
				53	#### Tokenization
				54
				55	Implemented in [tokenizer.py][tokenizer_py]
				56
				57	[tokenizer_py]: front_end/tokenizer.py
				58
				59	The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la
				60	Python. It divides source text into `parse_types.Symbol` objects, suitable for
				61	feeding into the parser.
				62
				63	#### Syntax Tree Generation
				64
				65	Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]
				66
				67	[lr1_py]: front_end/lr1.py
				68	[parser_generator_py]: front_end/parser_generator.py
				69	[structure_parser_py]: front_end/structure_parser.py
				70
				71	Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in
				72	three parts in Emboss:
				73
				74	* A generic parser generator implementing the table generation algorithms from
				75	[Compilers: Principles, Techniques, & Tools][dragon_book] and the
				76	error-marking algorithm from *[Generating LR Syntax Error Messages from
				77	Examples][jeffery_2003]*.
				78	* An Emboss-specific parser builder which glues the Emboss tokenizer, grammar,
				79	and error examples to the parser generator, producing an Emboss parser.
				80	* The Emboss grammar, which is extracted from the file normalizer
				81	([module_ir.py][module_ir_py]).
				82
				83	[dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811
				84	[jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566
				85
				86	#### Normalization
				87
				88	Implemented in [module_ir.py][module_ir_py]
				89
				90	[module_ir_py]: front_end/module_ir.py
				91
				92	Once a parse tree has been generated, it is fed into a normalizer which
				93	recursively turns the raw syntax tree into a "first stage" intermediate
				94	representation (IR). The first stage IR serves to isolate later stages from
				95	minor changes in the grammar, but only contains information from a single file,
				96	and does not perform any semantic checking.
				97
				98	### Import Resolution
				99
				100	TODO(bolms): Implement imports.
				101
				102	After each file is parsed, any new imports it has are added to a work queue.
				103	Each file in the work queue is parsed, potentially adding more imports to the
				104	queue, until the queue is empty.
				105
				106	### Symbol Resolution
				107
				108	Implemented in [symbol_resolver.py][symbol_resolver_py]
				109
				110	[symbol_resolver_py]: front_end/symbol_resolver.py
				111
				112	Symbol resolution is the process of correlating names in the IR. At the end of
				113	symbol resolution, every named entity (type definition, field definition, enum
				114	name, etc.) has a `CanonicalName`, and every reference in the IR has a
				115	`Reference` to the entity to which it refers.
				116
				117	This assignment occurs in two passes. First, the full IR is scanned, generating
				118	scoped symbol tables (nested dictionaries of names to `CanonicalName`), and
				119	assigning identities to each `Name` in the IR. Then the IR is fully scanned a
				120	second time, and each `Reference` in the IR is resolved: all scopes visible to
				121	the reference are scanned for the name, and the corresponding `CanonicalName` is
				122	assigned to the reference.
				123
				124	### Validation
				125
				126	TODO(bolms): other validations?
				127
				128	#### Size Checking
				129
				130	TODO(bolms): describe
				131
				132	#### Overlap Checking
				133
				134	TODO(bolms): describe
				135
				136	## Back End
				137
				138	Implemented in [back_end/...][back_end]
				139
				140	[back_end]: back_end/
				141
				142	Currently, only a C++ back end is implemented.
				143
				144	A back end takes Emboss IR and produces code in a specific language for
				145	manipulating the Emboss-defined data structures.
				146
				147	### C++
				148
				149	*Implemented in [header_generator.py][header_generator_py] with templates in
				150	[generated_code_templates][generated_code_templates], support code in
				151	[emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in
				152	[emboss_codegen_cpp.py][emboss_codegen_cpp_py]*
				153
				154	[header_generator_py]: back_end/cpp/header_generator.py
				155	[generated_code_templates]: back_end/cpp/generated_code_templates
				156	[emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h
				157	[emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py
				158
				159	The C++ code generator is currently very minimal. `header_generator.py`
				160	essentially inserts values from the IR into text templates.
				161
				162	TODO(bolms): add more documentation once the C++ back end has more features.