Design Sketch: Packed Field Notation

This document is provided for historical interest. This feature is now implemented in the form of the $next keyword.

Motivation

Many structures have many or most fields laid out consecutively, possibly with padding for alignment. For example:

struct Simple:
  0 [+2]  UInt  field_1
  2 [+4]  UInt  field_2
  6 [+2]  UInt  field_3

For simple structures of fixed-size fields, the main issue is unchecked redundancy: it is relatively easy to enter the wrong value for the field offset, and no compiler checks will help.

For more complex structures with multiple variable-sized fields, this can lead to unwieldy offsets:

struct Complex:
  0     [+2]  UInt    header_length (h)
  2     [+h]  Header  header
  2+h   [+2]  UInt    body_length (b)
  4+h   [+b]  Body    body
  4+h+b [+4]  UInt    crc

In both cases, there is some benefit to a shorthand notation that says 'this field should be placed immediately after the end of the lexically-previous field.

Example

(Exact syntax TBD.)

struct Complex:
  0     [+2]  UInt    header_length (h)
  $next [+h]  Header  header
  $next [+2]  UInt    body_length (b)
  $next [+b]  Body    body
  $next [+4]  UInt    crc

It is tempting to use some more specialized, terser syntax, like:

struct Complex:
  0  [+2]  UInt    header_length (h)
  ^^ [+h]  Header  header
  ^^ [+2]  UInt    body_length (b)
  ^^ [+b]  Body    body
  ^^ [+4]  UInt    crc

However, an explicit symbol has the advantage that you can use it in expressions, if needed:

struct ComplexWithGap:
  0       [+2]  UInt    header_length (h)
  $next   [+h]  Header  header
  $next   [+2]  UInt    body_length (b)
  # 2-byte reserved gap.
  $next+2 [+b]  Body    body
  $next   [+4]  UInt    crc

Or, with a (not-yet-implemented) $align() function:

struct ComplexWithAlignment:
  0                [+2]  UInt    header_length (h)
  $align($next, 4) [+h]  Header  header
  $align($next, 4) [+2]  UInt    body_length (b)
  $align($next, 4) [+b]  Body    body
  $align($next, 4) [+4]  UInt    crc

Implementation

Assuming the “new symbol” approach:

  1. Pick a new symbol name -- preferably, come up with a few alternatives and do a quick survey. By convention, Emboss built-in symbols start with $.
  2. Add the new name to LITERAL_TOKEN_PATTERNS in compiler/front_end/tokenizer.py.
  3. Add a new production for builtin-word -> "$new_symbol" to the _word() function in module_ir.py.
  4. Add a new compiler pass before synthetics.synthesize_fields, to replace the new symbol with the expanded representation. This should be relatively straightforward -- something that uses fast_traverse_ir_top_down() to find all ir_pb2.Structure elements in the IR, then iterates over the field offsets within each structure, and recursively replaces any ir_pb2.Expressions with a builtin_reference.canonical_name.object_path[0] equal to "$new_symbol". It would probably be useful to make traverse_ir._fast_traverse_proto_top_down() into a public function, so that you do not have to re-write Expression traversal.

For this change, the back end should not need any modifications.