| # Design Sketch: Protocol Buffers <=> Emboss Translation |
| |
| ## Overview |
| |
| There are many tools that operate on Protocol Buffer objects ("Protos"). |
| Providing a way to translate between Protos and Emboss structures would allow |
| those tools to be used without writing a tedious translation layer. |
| |
| |
| ## Defining an Equivalent Proto `message` |
| |
| For each Emboss `struct`, `bits`, `enum`, and primitive type, there would need |
| to be some equivalent Proto encoding -- likely a `message` for each `struct` or |
| `bits`, a Proto `enum` inside a `message` for each `enum` (see below), and a |
| Proto primitive type for each Emboss primitive type. |
| |
| There are two basic ways that the Proto definition could be generated: |
| |
| 1. Human-Authored `.proto` Definitions: |
| |
| This requires more human effort when trying to use Emboss structures as |
| Protos, likely approaching the level of effort to just hand-write a |
| translation layer. It *might* make it easier to use an existing Proto |
| definition. |
| |
| It would also require significantly more flexibility, and therefore more |
| complexity, in the Emboss compiler. |
| |
| 2. Emboss Generates a `.proto` File: |
| |
| This option is likely to create slightly "unnatural" Proto definitions (see |
| below for more details), but requires very little human effort to create a |
| translation to a Proto. |
| |
| Escape hatches for "partially hand-coded" translations should be |
| considered, even if they are not implemented in the first pass at Emboss |
| <=> Proto translation. |
| |
| Because a human always has the option to hand code their own translation, this |
| document will assume option 2: the Emboss compiler generates a Proto |
| definition. |
| |
| |
| ### Proto2 vs Proto3 |
| |
| The current state of Google Protocol Buffers is a bit messy, with both "version |
| 2" ("Proto2") and "version 3" ("Proto3") Protocol Buffers. Proto2 and Proto3 |
| can (mostly) freely interoperate -- Proto2 files can import and use messages |
| from Proto3 files and vice versa -- and both have long-term support guarantees |
| from Google. Differences between Proto2 and Proto3 are highlighted below: it |
| is not clear whether Emboss should generate Proto2, Proto3, or both (via a flag |
| or file-level property). |
| |
| |
| ### Primitive Types |
| |
| #### `Int`, `UInt` |
| |
| `Int` and `UInt` can map to Proto's `int32`, `int64`, `uint32`, and `uint64`. |
| Smaller integers can be extended to the next-largest Proto integer size. |
| |
| |
| #### `Float` |
| |
| `Float` maps to Proto's `float` and `double`. |
| |
| |
| #### `Flag` |
| |
| `Flag` maps to Proto's `bool`. |
| |
| |
| #### (Future) Emboss String/Blob Type |
| |
| A future Emboss string or blob type would translate to Proto's `string` or |
| `bytes`. It is likely that an Emboss "string" will be `bytes` in Proto, since |
| Emboss is unlikely to enforce UTF-8 compliance. |
| |
| Note that Proto (version 2 only?) C++ does not enforce UTF-8 compliance on |
| `string`, which can lead to crashes when the message is decoded in Python, |
| Java, or another language that properly enforces string encoding. |
| |
| |
| ### Arrays |
| |
| Unidimensional arrays map neatly to `repeated` Proto fields. |
| |
| Multidimensional arrays must be handled with a wrapper `message` at each |
| dimension after the first. |
| |
| Because of the way that Proto wire format works (see [Translation Between |
| Emboss View and Proto Wire Format](#between-emboss-view-and-proto-wire-format), |
| below), there is a slight technical advantage to wrapping the outermost array |
| in its own message. This does make the (Proto) API a bit awkward, but not too |
| bad: |
| |
| ```c++ |
| auto element = structure.array_field().v(2); |
| auto nested_element = structure.array_2d_field().v(2).v(1); |
| ``` |
| |
| vs |
| |
| ```c++ |
| auto element = structure.array_field(2); |
| auto nested_element = structure.array_2d_field(2).v(1); |
| ``` |
| |
| |
| ### Conditional Fields |
| |
| In Proto2, conditional fields map fairly well to the concept of "presence" for |
| fields. Proto2 allows non-present fields to be read -- returning the default |
| value for that field -- but this is not an issue for Emboss, which can easily |
| generate the appropriate <code>has_*field*()</code> calls. |
| |
| Proto3 does not track existence for primitive types the way that Proto2 does. |
| The "recommended" workaround is to use standardized wrapper types |
| (`google.protobuf.FloatValue`, `google.protobuf.Int32Value`, etc.), which |
| introduce an extra layer. There is a second workaround, related to the slightly |
| weird way that Proto handles `oneof`: if the primitive field is inside a |
| `oneof`, then it is *not* always present. A `oneof` may contain a single |
| member, so primitive-typed fields could be generated as something like: |
| |
| ``` |
| message Foo { |
| oneof field_1_oneof { |
| int32 field_1 = 1; |
| } |
| } |
| ``` |
| |
| Note that in Emboss, changing a field from unconditionally present to |
| conditionally present is (usually) a backwards-compatible change. |
| |
| |
| ### (Future) Emboss Union Construct |
| |
| An Emboss union construct would be necessary to take advantage of runtime space |
| savings from using a Proto `oneof`. |
| |
| |
| ### `struct` and `bits` |
| |
| `struct` and `bits` map neatly to `message`, with few issues. |
| |
| |
| #### Anonymous `bits` |
| |
| Anonymous `bits` get "flattened" so that their fields appear to be part of their |
| enclosing structure. This should be handled reasonably well via treating |
| read-write virtual fields as members of the `message`, and by suppressing the |
| "private" fields, such as anonymous `bits`. |
| |
| |
| #### Proto Field IDs |
| |
| Proto requires each field to have a unique tag ID. We propose that, for fields |
| with a fixed start location, the start location + 1 is used for a default tag |
| ID: since a change to a field's start location would be a breaking change to the |
| Emboss definition, it should be reasonably stable. For fields with a variable |
| start location, virtual fields, or where the programmer wants a specific tag, |
| the attribute `[(proto) id]` can be used to specify the ID. |
| |
| The "+ 1" is required since `0` is not a valid Proto tag ID. |
| |
| |
| ### `enum` |
| |
| The Emboss `enum` construct does not map cleanly to the Proto `enum` construct, |
| with different issues in Proto2 vs Proto3. |
| |
| Common to both, the names of Proto `enum` values are hoisted into the same |
| namespace as the `enum` itself (consistent with the C's handling of `enum`), |
| which means that multiple `enum`s in the same context cannot hold the same value |
| name. This can be handled -- somewhat awkwardly -- by wrapping the `enum` in a |
| "namespace" `message`, like: |
| |
| ``` |
| message SomeEnum { |
| enum SomeEnum { |
| VALUE1 = 1; |
| VALUE2 = 2; |
| } |
| } |
| ``` |
| |
| Additionally, Proto `enum` values must fit in an `int32`, whereas Emboss `enum` |
| values may require up to a `uint64`. |
| |
| Proto2: In Proto2, `enum`s are closed: unknown values are ignored on message |
| parse, so `enum` fields can never have an unknown value at runtime. Emboss |
| `enum`s, much like C `enum`s, can hold unknown values. |
| |
| Proto3: In Proto3, `enum`s are open, like Emboss `enum`s, but every Proto3 |
| `enum` must have a first entry whose value is `0`. In order to avoid |
| compatibility issues, Emboss should emit a well-known name for the `0` value in |
| every case. There is a second issue in Proto3: there is no "has" bit for enum |
| fields, so conditional enum fields have to be wrapped in a struct. |
| (TODO(bolms): are Proto3 `enum`s signed, unsigned, or either?) |
| |
| Thus, for Proto2, `enum`s would produce something like: |
| |
| ``` |
| message SomeEnum { |
| enum SomeEnum { |
| VALUE1 = 1; |
| VALUE2 = 2; |
| } |
| oneof { |
| SomeEnum value = 1; |
| int64 integer_value = 2; |
| } |
| } |
| ``` |
| |
| which would be included in structures as: |
| |
| ``` |
| message SomeStruct { |
| optional SomeEnum some_enum = 1; // NOT SomeEnum.SomeEnum |
| } |
| ``` |
| |
| For Proto3, the situation ends up similar: |
| |
| ``` |
| message SomeEnum { |
| enum SomeEnum { |
| DEFAULT = 0; |
| VALUE1 = 1; |
| VALUE2 = 2; |
| } |
| SomeEnum value = 1; |
| } |
| |
| message SomeStruct { |
| optional SomeEnum some_enum = 1; // NOT SomeEnum.SomeEnum |
| } |
| ``` |
| |
| |
| #### `enum` Name Restrictions |
| |
| Proto enforces a (very slightly) stricter rule for the names of values within |
| an `enum` than Emboss does: they must not collide *even when translated to |
| CamelCase*. |
| |
| For example, Emboss allows: |
| |
| ``` |
| enum Foo: |
| BAR_1_1 = 2 |
| BAR_11 = 11 |
| ``` |
| |
| When translated to CamelCase, `BAR_1_1` and `BAR_11` both become `Bar11`, and |
| thus are not allowed to be part of the same `enum` in Proto. |
| |
| It may be sufficient to require `.emb` authors to update their `enum`s when |
| attempting to compile to Proto. |
| |
| |
| ### Bookkeeping Fields |
| |
| Emboss structures often have "bookkeeping" fields that are either irrelevant to |
| typical Proto consumers, or place unusual restrictions. |
| |
| For example, fields which are used to calculate the offset of other fields are |
| generally not useful to Proto consumers: |
| |
| ``` |
| struct Foo: |
| 0 [+4] UInt header_length (h) |
| h [+4] UInt first_body_message |
| ``` |
| |
| **These fields would still need to be set correctly when translating *from* |
| Proto to Emboss.** |
| |
| Some of the pain could likely be mitigated via a [default |
| values](#default_values.md) feature, when implemented. |
| |
| Field-length fields are somewhat trickier: |
| |
| ``` |
| struct Foo: |
| 0 [+4] UInt message_length (m) |
| 4 [+m] UInt:8[] message_bytes |
| ``` |
| |
| In Proto, `message_length` becomes an implicit part of `message_bytes`, since |
| `message_bytes` knows its own length. For simple fields cases, as above, we |
| can likely have the Emboss compiler "just figure it out" and fold |
| `message_length` into `message_bytes`. For more complex cases, we will |
| probably need to have explicit annotations (`[(proto) set_length_by: x = |
| some_expression]`), or just require applications using the Proto side to set |
| length fields correctly. |
| |
| A similar problem happens with "message type" fields: |
| |
| ``` |
| struct Foo: |
| 0 [+4] MessageType message_type (mt) |
| if mt == MessageType.BAR: |
| 4 [+8] Bar bar |
| if mt == MessageType.BAZ: |
| 4 [+16] Baz baz |
| # ... |
| ``` |
| |
| This will probably be easier to handle with a `union` construct in Emboss. |
| Again, "complex" cases will probably have to be handled by application code. |
| |
| |
| ## Translation |
| |
| ### Between Emboss View and Proto In-Memory Format |
| |
| Translation should be relatively straightforward; when going from Emboss to |
| Proto, the problem is roughly equivalent to serializing a View to text, and for |
| Proto to Emboss it is roughly equivalent to deserializing a View from text. |
| |
| One minor difference is that the *deserialization* from Proto must occur in |
| dependency order, while serialization can happen in any order. In Emboss text |
| format, *serialization* happens in dependency order, and deserialization happens |
| in whatever order is specified in the text. |
| |
| As with deserialization from text, it is possible for the Proto message to |
| include untranslatable entries (e.g., an Emboss `Int:16` would stored in a Proto |
| `int32`; a too-large value in the Proto `message` should be rejected). |
| |
| |
| ### Between Emboss View and Proto Wire Format |
| |
| Since the Proto wire format is extremely stable and documented, it would be |
| possible for Emboss to emit code to directly translate between Emboss structs |
| and proto wire format. |
| |
| *Serialization* is relatively straightforward; except for arrays, the code |
| structure is almost identical to the text serialization code structure. |
| |
| *Deserialization* is problematic. First and foremost, Proto does not specify an |
| order in which the fields of a structure will be serialized, so it is entirely |
| possible for the Emboss view to see a dependent field before its prerequisite |
| (e.g., have a variable-offset field before the offset specifier field). |
| Secondly, Proto repeated fields aren't really "arrays"; on the wire, other |
| fields can appear *in between* elements of repeated fields. For Emboss, this |
| means that every array in the structure would have to maintain a cursor during |
| deserialization. |
| |
| It *may* still be desirable to support serialization without trying to support |
| deserialization, or to support deserialization for a subset of structures, so |
| that we can send protos to/from microcontrollers: this would be an alternative |
| to Nanopb for some cases. |
| |
| |
| ### Between Emboss View and [Nanopb](https://github.com/nanopb/nanopb) |
| |
| In order to translate between Emboss views and Protos on microcontrollers and |
| other limited-memory devices, it may make sense to generate Emboss <=> Nanopb |
| code. On top of the standard Proto generator, we would have to implement a |
| Nanopb options file generator, and translation code. |
| |
| |
| ## Miscellaneous Notes |
| |
| ### Overlays |
| |
| Emboss was designed with the notion that some backends would need their own |
| attributes -- for example, the `[(cpp) namespace]` attribute, and here there |
| are a number of `[(proto)]` attributes. |
| |
| However, adding back-end-specific attributes still requires changes to be made |
| directly to the `.emb` file, which may be inconvenient for `.emb`s from third |
| parties. |
| |
| Ideally, one could write an "overlay file," like: |
| |
| ``` |
| message Foo |
| [(proto) attr = value] |
| |
| field |
| [(proto) field_attr = value] |
| ``` |
| |
| This is not needed for a first pass at a Proto back end, but should be |
| considered. |
| |
| |
| ### Generating an `.emb` From a `.proto` |
| |
| There are cases where it would be useful to generate a microcontroller-friendly |
| representation of an existing Proto, rather than the other way around. |
| |
| For most `message`s, it would be relatively straightforward to generate a |
| `struct`, like: |
| |
| ``` |
| message Foo { |
| optional int32 bar = 1; |
| optional bool baz = 2; |
| optional string qux = 3; |
| } |
| ``` |
| |
| to: |
| |
| ``` |
| struct Foo: |
| 0 [+4] bits: |
| 0 [+1] Flag has_bar |
| 1 [+1] Flag has_baz |
| if has_baz: |
| 2 [+1] Flag baz |
| 2 [+1] Flag has_qux |
| |
| if has_bar: |
| 4 [+4] Int:32 bar |
| |
| if has_qux: |
| 8 [+4] UInt:32 qux_offset |
| 12 [+4] UInt:32 qux_length |
| qux_offset [+qux_length] UInt:8[] qux |
| ``` |
| |
| The main issue is that it would be difficult to maintain equivalent |
| backwards-compatibility guarantees to the ones that Proto provides as messages |
| evolve. |
| |
| Also note that this format is fairly close to the [Cap'n |
| Proto](https://capnproto.org/) format. |