| # Edition Zero Features |
| |
| **Authors:** [@mcy](https://github.com/mcy), |
| [@zhangskz](https://github.com/zhangskz), |
| [@mkruskal-google](https://github.com/mkruskal-google) |
| |
| **Approved:** 2022-07-22 |
| |
| Feature flags, and their defaults, that we will introduce to define the |
| converged semantics of Edition Zero. |
| |
| **NOTE:** This document is largely replaced by the topic, |
| [Feature Settings for Editions](https://protobuf.dev/editions/features) (to be |
| released soon). |
| |
| ## Overview |
| |
| *Edition Zero Features* defines the "first edition" of the brave new world of |
| no-`syntax` Protobuf. This document defines the actual mechanics of the features |
| (in the narrow sense of editions) we need to implement in protoc, as well as the |
| chosen defaults. |
| |
| This document will require careful review from various stakeholders, because it |
| is essentially defining a new Protobuf `syntax`, even if it isn't spelled that |
| way. In particular, we need to ensure that there is a way to rewrite existing |
| `proto2` and `proto3` files as `editions` files, and the behavior of "mixed |
| syntax" messages, without any nasty surprises. |
| |
| Note that it is an explicit goal that it be possible to take an arbitrary |
| proto2/proto3 file and convert it to editions without semantic changes, via |
| appropriate application of features. |
| |
| ## Existing Non-Conformance |
| |
| We must keep in mind that the status quo is messy. Many languages have some |
| areas where they currently diverge from the correct proto2/proto3 semantics. For |
| edition zero, we must preserve these idiosyncratic behaviors, because that is |
| the only way for a proto2/proto3 -> editions LSC to be a no-op. |
| |
| For example, in this document we define a feature `features.enum = |
| {CLOSED,OPEN}`. But currently Go does not implement closed enum semantics for |
| `syntax=proto2` as it should. This behavior is out of conformance, but we must |
| preserve this out-of-conformance behavior for edition zero. |
| |
| In other words, defining features and their semantics is in scope for edition |
| zero, but fixing code generators to perfectly match those semantics is |
| explicitly out-of-scope. |
| |
| ## Glossary |
| |
| Because we need to speak of two proto syntaxes, `proto2` and `proto3`, that have |
| disagreeing terminology in some places, we'll define the following terms to aid |
| discussion. When a term appears in `code font`, it refers to the Protobuf |
| language keyword. |
| |
| * A **presence discipline** is a handling for the presence (or hasbit) of a |
| field. Every field notionally has a hasbit: whether it has been explicitly |
| set via the API or whether a record for it was present on deserialization. |
| See |
| [Application Note: Field Presence](https://protobuf.dev/programming-guides/field_presence) |
| for more on this topic. The discipline specifies how this bit is surfaced to |
| the user: |
| * **No presence** means that the API does not expose the hasbit. The |
| default value for the field behaves somewhat like a special sentinel |
| value, which is not serialized and not merged-from. The hasbit may still |
| exist in the implementation (C++ accidentally leaks this via HasField, |
| for example). Note that repeated fields sort-of behave like no presence |
| fields. |
| * **Explicit presence** means that the API exposes the hasbit through a |
| `has` method and a `Clear` method; default values are always serialized |
| if the hasbit is set. |
| * A **closed enum** is an enum where parsing requires validating that a parsed |
| `int32` representing a field of this type matches one of the known set of |
| valid values. |
| * An **open enum** does not have this restriction, and is just an `int32` |
| field with well-known values. |
| |
| For the purposes of this document, we will use the syntax described in *Features |
| as Custom Options*, since it is the prevailing consensus among those working on |
| editions, and allows us to have enum-typed features. The exact names for the |
| features are a matter of bikeshedding. |
| |
| ## Proposed Converged Semantics |
| |
| There are two kinds of syntax behaviors we need to capture: those that are |
| turned on by a keyword, like `required`, and those that are implicit, like open |
| enums. The differences between proto2 and proto3 today are: |
| |
| * Required. Proto2 has `required` but not `defaulted`; Proto3 has `defaulted` |
| but not `required`. Proto3 also does not allow custom defaults on |
| `defaulted` fields, and on message-typed fields, `defaulted` is a synonym |
| for `optional`. |
| * Groups. Proto2 has groups, proto3 does not. |
| * Enums. In Proto2, enums are **closed**: messages that have an enum not in |
| the known set are stored in the unknown field set. In Proto3, enums are |
| **open**. |
| * String validation. Proto2 is a bit wobbly on whether strings must be UTF-8 |
| when serialized; Proto3 enforces this (sometimes). |
| * Extensions. Proto2 has extensions, while Proto3 does not (`Any` is the |
| canonical workaround). |
| |
| We propose defining the following features as part of edition zero: |
| |
| ### features.field_presence |
| |
| This feature is enum-typed and controls the presence discipline of a singular |
| field: |
| |
| * `EXPLICIT` (default) - the field has *explicit presence* discipline. Any |
| explicitly set value will be serialized onto the wire (even if it is the |
| same as the default value). |
| * `IMPLICIT` - the field has *no presence* discipline. The default value is |
| not serialized onto the wire (even if it is explicitly set). |
| * `LEGACY_REQUIRED` - the field is wire-required and API-optional. Setting |
| this will require being in the `required` allowlist. Any explicitly set |
| value will be serialized onto the wire (even if it is the same as the |
| default value). |
| |
| The syntax for singular fields is a much debated question. After discussing the |
| tradeoffs, we have chosen to *eliminate both the `optional` and `required` |
| keywords, making them parse errors*. Singular fields are spelled as in proto3 |
| (no label), and will take on the presence discipline given by |
| `features.:presence`. Migration will require deleting every instance of |
| `optional` in proto files in google3, of which there are 385,236. |
| |
| It is important to observe that proto2 users are much likelier to care about |
| presence than proto3 users, since the design of proto3 discourages thinking |
| about presence as an interesting feature of protos, so arguably introducing |
| proto2-style presence will not register on most users' mental radars. This is |
| difficult to prove concretely. |
| |
| `IMPLICIT` fields behave much like proto3 implicit fields: they cannot have |
| custom defaults and are ignored on submessage fields. Also, if it is an |
| enum-typed field, that enum must be open (i.e., it is either defined in a |
| `syntax = proto3;` file or it specifies `option features.enum = OPEN;` |
| transitively). |
| |
| We also make some semantic changes: |
| |
| * ~~`IMPLICIT``fields may have a custom default value, unlike in`proto3`. |
| Whether an`IMPLICIT` field containing its default value is serialized |
| becomes an implementation choice (implementations are encouraged to try to |
| avoid serializing too much, though).~~ |
| * `has_optional_keyword()` and `has_presence()` now check for `EXPLICIT`, and |
| are effectively synonyms. |
| * `proto3_optional` is rejected as a parse error (use the feature instead). |
| |
| Migrating from proto2/3 involves deleting all `optional`/`required` labels and |
| adding `IMPLICIT` and `LEGACY_REQUIURED` annotations where necessary. |
| |
| #### Alternatives |
| |
| * For syntax: |
| * Require `optional`. This may confuse proto3 users who are used to |
| `optional` not being a default they reach for. Will result in |
| significant (trivial, but noisy) churn in proto3 files. The keyword is |
| effectively line noise, since it does not indicate anything other than |
| "this is a singular field". |
| * Invent a new label, like `singular`. This results in more churn but |
| avoids breaking peoples' priors. |
| * Allow `optional` and no label to coexist in a file, which take on their |
| original meanings unless overridden by `features.field_presence`. The |
| fact that a top-level `features.field_presence = IMPLICIT` breaks the |
| proto3 expectation that `optional` means `EXPLICIT` may be a source of |
| confusion. |
| * `proto:allow_required`, which must be present for `required` to not be a |
| syntax error. |
| * Allow `required`/`optional` and introduce `defaulted` as a real keyword. We |
| will not have another easy chance to introduce such syntax (which we do, |
| because `edition = ...` is a breaking change). |
| * Reject custom defaults for `IMPLICIT` fields. This is technically not really |
| needed for converged semantics, but trying to remove the Proto3-ness from |
| `IMPLICIT` fields seems useful for consistency. |
| |
| #### Future Work |
| |
| In the future, we can introduce something like `features.always_serialize` or a |
| similar new enumerator (`ALWAYS_SERIALIZE`) to the `when_missing` enum, which |
| makes `EXPLICIT_PRESENCE` fields unconditionally serialized, allowing |
| `LEGACY_REQUIRED` fields to become `EXPLICIT_PRESENCE` in a future large-scale |
| change. The details of such a migration are out-of-scope for this document. |
| |
| #### Migration Examples |
| |
| Given the following files: |
| |
| ``` |
| // foo.proto |
| syntax = "proto2" |
| |
| message Foo { |
| required int32 x = 1; |
| optional int32 y = 2; |
| repeated int32 z = 3; |
| } |
| |
| // bar.proto |
| syntax = "proto3" |
| |
| message Bar { |
| int32 x = 1; |
| optional int32 y = 2; |
| repeated int32 z = 3; |
| } |
| ``` |
| |
| post-editions, they will look like this: |
| |
| ``` |
| // foo.proto |
| edition = "tbd" |
| |
| message Foo { |
| int32 x = 1 [features.field_presence = LEGACY_REQUIRED]; |
| int32 y = 2; |
| repeated int32 z = 3; |
| } |
| |
| // bar.proto |
| edition = "tbd" |
| option features.field_presence = NO_PRESENCE; |
| |
| message Bar { |
| int32 x = 1; |
| int32 y = 2 [features.field_presence = EXPLICIT_PRESENCE]; |
| repeated int32 z = 3; |
| } |
| ``` |
| |
| ### features.enum_type |
| |
| Enum types come in two distinct flavors: *closed* and *open*. |
| |
| * *closed* enums will store enum values that are out of range in the unknown |
| field set. |
| * *open* enums will parse out of range values into their fields directly. |
| |
| **NOTE:** Closed enums can cause confusion for parallel arrays (two repeated |
| fields that expect to have index i refer to the same logical concept in both |
| fields) because an unknown enum value from a parallel array will be placed |
| in the unknown field set and the arrays will cease being parallel. Similarly |
| parsing and serializing can change the order of a repeated closed enum by |
| moving unknown values to the end. |
| |
| **NOTE:** Some runtimes (C++ and Java, in particular) currently do not use |
| the declaration site of enums to determine whether an enum field is treated |
| as open; rather, they use the syntax of the message the field is defined in, |
| instead. To preserve this proto2 quirk until we can migrate users off of it, |
| Java and C++ (and runtimes with the same quirk) will use the value of |
| `features.enum` as set at the file level of messages (so, if a file sets |
| `features.enum = CLOSED` at the file level, enum fields defined in it behave |
| as if the enum was closed, regardless of declaration). IMPLICIT singular |
| fields in Java and C++ ignore this and are always treated as open, because |
| they used to only be possible to define in proto3 files, which can't use |
| proto2 enums. |
| |
| In proto2, `enum` values are closed and no requirements are placed upon the |
| first `enum` value. The first enum value will be used as the default value. |
| |
| In proto3, `enum` values are open and the first `enum` value must be zero. The |
| first `enum` value is used as the default value, but that value is required to |
| be zero. |
| |
| In edition zero, We will add a feature `features.enum_type = {CLOSED,OPEN}`. The |
| default will be `OPEN`. Upgraded proto2 files will explicitly set |
| `features.enum_type = CLOSED`. The requirement of having the first enum value be |
| zero will be dropped. |
| |
| **NOTE:** Nominally this exposes a new state in the configuration space, OPEN |
| enums with a non-zero default. We decided that excluding this option simply |
| because it was previously inexpressible was a false economy. |
| |
| #### Alternatives |
| |
| * We could add a property for requiring a zero first value for an enum. This |
| feels needlessly complicated. |
| * We could drop the ability to have `CLOSED` enums, but that is a semantic |
| change. |
| |
| #### Migration Examples |
| |
| Given the following files: |
| |
| ``` |
| // foo.proto |
| syntax = "proto2" |
| |
| enum Foo { |
| A = 2, B = 4, C = 6, |
| } |
| |
| // bar.proto |
| syntax = "proto3" |
| |
| enum Bar { |
| A = 0, B = 1, C = 5, |
| } |
| ``` |
| |
| post-editions, they will look like this: |
| |
| ``` |
| // foo.proto |
| edition = "tbd" |
| option features.enum_type = CLOSED; |
| |
| enum Foo { |
| A = 2, B = 4, C = 6, |
| } |
| |
| // bar.proto |
| edition = "tbd" |
| |
| enum Bar { |
| A = 0, B = 1, C = 5, |
| } |
| ``` |
| |
| If we wanted to merge them into one file, it would look like this: |
| |
| ``` |
| // foo.proto |
| edition = "tbd" |
| |
| enum Foo { |
| option features.enum_type = CLOSED; |
| A = 2, B = 4, C = 6, |
| } |
| |
| |
| enum Bar { |
| A = 0, B = 1, C = 5, |
| } |
| ``` |
| |
| ### features.repeated_field_encoding |
| |
| In proto3, the `repeated_field_encoding` attribute defaults to `PACKED`. In |
| proto2, the `repeated_field_encoding` attribute defaults to `EXPANDED`. Users |
| explicitly enabled packed fields 12.3k times, but only explicitly disable it 200 |
| times. Thus we can see a clear preference for `repeated_field_encoding = PACKED` |
| emerge. This data matches best practices. As such, the default value for |
| `features.repeated_field_encoding` will be `PACKED`. |
| |
| The existing `[packed = …]` syntax will be made an alias for setting the feature |
| in edition zero. This alias will eventually be removed. Whether that removal |
| happens during the initial large-scale change to enable edition zero or as a |
| follow on will be decided at the time. |
| |
| In the long term, we would like to remove explicit usages of |
| `features.repeated_field_encoding = EXPANDED`, but we would prefer to separate |
| that large-scale change from the landing of edition zero. So we will explicitly |
| set `features.repeated_field_encoding` to `EXPANDED` at the file level when we |
| migrate proto2 files to edition zero. |
| |
| #### Alternatives |
| |
| * Force everyone to use packed fields. This is a semantic change, which we're |
| trying to avoid in edition zero. |
| * Don’t add `features.repeated_field_encoding` and instead specify `[packed = |
| false]` when converting from proto2. This will be incredibly noisy, |
| syntax-wise and diff-wise. |
| |
| #### Migration Examples |
| |
| Given the following files: |
| |
| ``` |
| // foo.proto |
| syntax = "proto2" |
| |
| message Foo { |
| repeated int32 x = 1; |
| repeated int32 y = 2 [packed = true]; |
| repeated int32 z = 3; |
| } |
| |
| // bar.proto |
| syntax = "proto3" |
| |
| message Foo { |
| repeated int32 x = 1; |
| repeated int32 y = 2 [packed = false]; |
| repeated int32 z = 3; |
| } |
| ``` |
| |
| post-editions, they will look like this: |
| |
| ``` |
| // foo.proto |
| edition = "tbd" |
| options features.repeated_field_encoding = EXPANDED; |
| |
| message Foo { |
| repeated int32 x = 1; |
| repeated int32 y = 2 [packed = true]; |
| repeated int32 z = 3; |
| } |
| |
| |
| // bar.proto |
| edition = "tbd" |
| |
| message Foo { |
| repeated int32 x = 1; |
| repeated int32 y = 2 [packed = false]; |
| repeated int32 z = 3; |
| } |
| ``` |
| |
| Note that post migration, we have not changed `packed` to |
| `features.repeated_field_encoding = PACKED`, although we could choose to do so |
| if the diff cost is not monumental. We prefer to defer to an LSC after editions |
| are shipped, if possible. |
| |
| ### features.string_field_validation |
| |
| **WARNING:** UTF8 validation is actually messier than originally believed. This |
| feature is being reconsidered in _Editions Zero Feature: utf8_validation_. |
| |
| This feature is a tristate: |
| |
| * `MANDATORY` - this means that a runtime MUST verify UTF-8. |
| * `HINT` - this means that a runtime may refuse to parse invalid UTF-8, but it |
| can also simply skip the check for performance in some build modes. |
| * `NONE` - this field behaves like a `bytes` field on the wire, but parsers |
| may mangle the string in an unspecified way (for example, Java may insert |
| spaces as replacement characters). |
| |
| The default will be `MANDATORY`. |
| |
| Long term, we would like to remove this feature and make all `string` fields |
| `MANDATORY`. |
| |
| #### Alternatives |
| |
| * Drop the UTF-8 requirements completely. This seems like it will create more |
| problems than it will solve (e.g., random things relying on validation need |
| to be fixed) and it will be a lot of work. This is also counter to the |
| vision of string being a UTF-8 type, and bytes being its unchecked sibling. |
| * Make opt-in verification a hard requirement instead of a hint, so that users |
| have a nice performance needle they can play with. |
| |
| #### Future Work |
| |
| In the infinite future, we would like to remove this feature and force all |
| `string` fields to be UTF-8 validated. To do this, we need to recognize that |
| what many callers want from their `string` fields is a `bytes` field with a |
| `string`-like API. To ease the transition, we would add per-codegen backend |
| features, like `java.bytes_as_string`, that give a `bytes` field a generated API |
| resembling that of a `string` field (with caveats about replacement characters |
| forced by the host language's string type). |
| |
| The migration would take `HINT` or `SKIP` `string` fields and convert them into |
| `bytes` fields with the appropriate API modifiers, depending on which languages |
| use that proto; C++-only protos, for example, are a no-op. |
| |
| There is an argument to be made for "I want a string type, and I explicitly want |
| replacement U+FFFD characters if I get something that isn't UTF-8." It is |
| unclear if this is something users want and we would need to investigate it |
| before making a decision. |
| |
| ### features.json_format |
| |
| This feature is dual state in edition zero: |
| |
| * `ALLOW` - this means that a runtime must allow JSON parsing and |
| serialization. Checks will be applied at the proto level to make sure that |
| there is a well-defined mapping to JSON. |
| * `LEGACY_BEST_EFFORT` - this means that a runtime will do the best it can to |
| parse and serialize JSON. Certain protos will be allowed that can result in |
| undefined behavior at runtime (e.g. many:1 or 1:many mappings). |
| |
| The default will be `ALLOW`, which maps the to the current proto3 behavior. |
| `LEGACY_BEST_EFFORT` will be used for proto2 files that require it (e.g. they’ve |
| set `deprecated_legacy_json_field_conflicts`) |
| |
| #### Alternatives |
| |
| * Keep the proto2 behavior - this will regress proto3 files by removing |
| validation for JSON mappings, and lead to *more* undefined runtime behavior |
| * Only use `ALLOW` - there are ~30 cases internally where protos have invalid |
| JSON mappings and rely on unspecified (but luckily well defined) runtime |
| behavior. |
| |
| #### Future Work |
| |
| Long term, we would like to either remove this feature entirely or add a |
| `DISALLOW` option instead of `LEGACY_BEST_EFFORT`. This will more strictly |
| enforce that protos without a valid JSON mapping *can’t* be serialized or parsed |
| to JSON. `DISALLOW` will be enforced at the proto-language level, where no |
| message marked `ALLOW` can contain any message/enum marked `DISALLOW` (e.g. |
| through extensions or fields) |
| |
| #### Migration Examples |
| |
| ### Extensions are Always Allowed |
| |
| Extensions may be used on all messages. This lifts a restriction from proto3. |
| |
| Extensions do not play nicely with `TypeResolver`. This is actually fixable, but |
| probably only worth it if someone complains. |
| |
| #### Alternatives |
| |
| * Add `features.allow_extensions`, default true. This feels unnecessary since |
| uttering `extend` and `extensions` is required to use extensions in the |
| first place. |
| |
| ### features.message_encoding |
| |
| This feature defaults to `LENGTH_PREFIXED`. The `group` syntax does not exist |
| under editions. Instead, message-typed fields that have |
| `features.message_encoding = DELIMITED` set will be encoded as groups (wire type |
| 3/4) rather than byte blobs (wire type 2). This reflects the existing API |
| (groups are funny message fields) and simplifies the parser. |
| |
| A `proto2` group field will be converted into a nested message type of the same |
| name, and a singular submessage field that is `features.message_encoding = |
| DELIMITED` with the message type's name in snake_case. |
| |
| This could be used in the future to switch new message fields to use group |
| encoding, which suggested previously as an efficiency direction. |
| |
| #### Alternatives |
| |
| * Allow groups in `editions` with no changes. `group` syntax is deprecated, so |
| we may as well take the opportunity to knock it out. |
| * Add a sidecar allowlist like we do for `required`. This is mostly |
| orthogonal. |
| |
| #### Migration Examples |
| |
| Given the following file |
| |
| ``` |
| // foo.proto |
| syntax = "proto2" |
| |
| message Foo { |
| group Bar = 1 { |
| optional int32 x = 1; |
| repeated int32 y = 2; |
| } |
| } |
| ``` |
| |
| post-editions, it will look like this: |
| |
| ``` |
| // foo.proto |
| edition = "tbd" |
| |
| message Foo { |
| message Bar { |
| optional int32 x = 1; |
| repeated int32 y = 2; |
| } |
| Bar bar = 1 [features.message_encoding = DELIMITED]; |
| } |
| ``` |
| |
| ## Proposed Features Message |
| |
| Putting together all of the above, we propose the following `Features` message, |
| including retention and target rules associated with fields. |
| |
| ``` |
| message Features { |
| enum FieldPresence { |
| EXPLICIT = 0; |
| IMPLICIT = 1; |
| LEGACY_REQUIRED = 2; |
| } |
| optional FieldPresence field_presence = 1 [ |
| retention = RUNTIME, |
| target = FILE, |
| target = FIELD |
| ]; |
| |
| enum EnumType { |
| OPEN = 0; |
| CLOSED = 1; |
| } |
| optional EnumType enum = 2 [ |
| retention = RUNTIME, |
| target = FILE, |
| target = ENUM |
| ]; |
| |
| enum RepeatedFieldEncoding { |
| PACKED = 0; |
| UNPACKED = 1; |
| } |
| optional RepeatedFieldEncoding repeated_field_encoding = 3 [ |
| retention = RUNTIME, |
| target = FILE, |
| target = FIELD |
| ]; |
| |
| enum StringFieldValidation { |
| MANDATORY = 0; |
| HINT = 1; |
| NONE = 2; |
| } |
| optional StringFieldValidation string_field_validation = 4 [ |
| retention = RUNTIME, |
| target = FILE, |
| target = FIELD |
| ]; |
| |
| enum MessageEncoding { |
| LENGTH_PREFIXED = 0; |
| DELIMITED = 1; |
| } |
| optional MessageEncoding message_encoding = 5 [ |
| retention = RUNTIME, |
| target = FILE, |
| target = FIELD |
| ]; |
| |
| extensions 1000; // for features_cpp.proto |
| extensions 1001; // for features_java.proto |
| } |
| ``` |