docs/design/editions/edition-zero-features.md - third_party/github/protocolbuffers/protobuf - Git at Google

 # Edition Zero Features

 **Authors:** [@mcy](https://github.com/mcy),
 [@zhangskz](https://github.com/zhangskz),
 [@mkruskal-google](https://github.com/mkruskal-google)

 **Approved:** 2022-07-22

 Feature flags, and their defaults, that we will introduce to define the
 converged semantics of Edition Zero.

 **NOTE:** This document is largely replaced by the topic,
 [Feature Settings for Editions](https://protobuf.dev/editions/features) (to be
 released soon).

 ## Overview

 *Edition Zero Features* defines the "first edition" of the brave new world of
 no-`syntax` Protobuf. This document defines the actual mechanics of the features
 (in the narrow sense of editions) we need to implement in protoc, as well as the
 chosen defaults.

 This document will require careful review from various stakeholders, because it
 is essentially defining a new Protobuf `syntax`, even if it isn't spelled that
 way. In particular, we need to ensure that there is a way to rewrite existing
 `proto2` and `proto3` files as `editions` files, and the behavior of "mixed
 syntax" messages, without any nasty surprises.

 Note that it is an explicit goal that it be possible to take an arbitrary
 proto2/proto3 file and convert it to editions without semantic changes, via
 appropriate application of features.

 ## Existing Non-Conformance

 We must keep in mind that the status quo is messy. Many languages have some
 areas where they currently diverge from the correct proto2/proto3 semantics. For
 edition zero, we must preserve these idiosyncratic behaviors, because that is
 the only way for a proto2/proto3 -> editions LSC to be a no-op.

 For example, in this document we define a feature `features.enum =
 {CLOSED,OPEN}`. But currently Go does not implement closed enum semantics for
 `syntax=proto2` as it should. This behavior is out of conformance, but we must
 preserve this out-of-conformance behavior for edition zero.

 In other words, defining features and their semantics is in scope for edition
 zero, but fixing code generators to perfectly match those semantics is
 explicitly out-of-scope.

 ## Glossary

 Because we need to speak of two proto syntaxes, `proto2` and `proto3`, that have
 disagreeing terminology in some places, we'll define the following terms to aid
 discussion. When a term appears in `code font`, it refers to the Protobuf
 language keyword.

 *   A **presence discipline** is a handling for the presence (or hasbit) of a
     field. Every field notionally has a hasbit: whether it has been explicitly
     set via the API or whether a record for it was present on deserialization.
     See
     [Application Note: Field Presence](https://protobuf.dev/programming-guides/field_presence)
     for more on this topic. The discipline specifies how this bit is surfaced to
     the user:
     *   **No presence** means that the API does not expose the hasbit. The
         default value for the field behaves somewhat like a special sentinel
         value, which is not serialized and not merged-from. The hasbit may still
         exist in the implementation (C++ accidentally leaks this via HasField,
         for example). Note that repeated fields sort-of behave like no presence
         fields.
     *   **Explicit presence** means that the API exposes the hasbit through a
         `has` method and a `Clear` method; default values are always serialized
         if the hasbit is set.
 *   A **closed enum** is an enum where parsing requires validating that a parsed
     `int32` representing a field of this type matches one of the known set of
     valid values.
 *   An **open enum** does not have this restriction, and is just an `int32`
     field with well-known values.

 For the purposes of this document, we will use the syntax described in *Features
 as Custom Options*, since it is the prevailing consensus among those working on
 editions, and allows us to have enum-typed features. The exact names for the
 features are a matter of bikeshedding.

 ## Proposed Converged Semantics

 There are two kinds of syntax behaviors we need to capture: those that are
 turned on by a keyword, like `required`, and those that are implicit, like open
 enums. The differences between proto2 and proto3 today are:

 *   Required. Proto2 has `required` but not `defaulted`; Proto3 has `defaulted`
     but not `required`. Proto3 also does not allow custom defaults on
     `defaulted` fields, and on message-typed fields, `defaulted` is a synonym
     for `optional`.
 *   Groups. Proto2 has groups, proto3 does not.
 *   Enums. In Proto2, enums are **closed**: messages that have an enum not in
     the known set are stored in the unknown field set. In Proto3, enums are
     **open**.
 *   String validation. Proto2 is a bit wobbly on whether strings must be UTF-8
     when serialized; Proto3 enforces this (sometimes).
 *   Extensions. Proto2 has extensions, while Proto3 does not (`Any` is the
     canonical workaround).

 We propose defining the following features as part of edition zero:

 ### features.field_presence

 This feature is enum-typed and controls the presence discipline of a singular
 field:

 *   `EXPLICIT` (default) - the field has *explicit presence* discipline. Any
     explicitly set value will be serialized onto the wire (even if it is the
     same as the default value).
 *   `IMPLICIT` - the field has *no presence* discipline. The default value is
     not serialized onto the wire (even if it is explicitly set).
 *   `LEGACY_REQUIRED` - the field is wire-required and API-optional. Setting
     this will require being in the `required` allowlist. Any explicitly set
     value will be serialized onto the wire (even if it is the same as the
     default value).

 The syntax for singular fields is a much debated question. After discussing the
 tradeoffs, we have chosen to *eliminate both the `optional` and `required`
 keywords, making them parse errors*. Singular fields are spelled as in proto3
 (no label), and will take on the presence discipline given by
 `features.:presence`. Migration will require deleting every instance of
 `optional` in proto files in google3, of which there are 385,236.

 It is important to observe that proto2 users are much likelier to care about
 presence than proto3 users, since the design of proto3 discourages thinking
 about presence as an interesting feature of protos, so arguably introducing
 proto2-style presence will not register on most users' mental radars. This is
 difficult to prove concretely.

 `IMPLICIT` fields behave much like proto3 implicit fields: they cannot have
 custom defaults and are ignored on submessage fields. Also, if it is an
 enum-typed field, that enum must be open (i.e., it is either defined in a
 `syntax = proto3;` file or it specifies `option features.enum = OPEN;`
 transitively).

 We also make some semantic changes:

 *   ~~`IMPLICIT``fields may have a custom default value, unlike in`proto3`.
     Whether an`IMPLICIT` field containing its default value is serialized
     becomes an implementation choice (implementations are encouraged to try to
     avoid serializing too much, though).~~
 *   `has_optional_keyword()` and `has_presence()` now check for `EXPLICIT`, and
     are effectively synonyms.
 *   `proto3_optional` is rejected as a parse error (use the feature instead).

 Migrating from proto2/3 involves deleting all `optional`/`required` labels and
 adding `IMPLICIT` and `LEGACY_REQUIURED` annotations where necessary.

 #### Alternatives

 *   For syntax:
     *   Require `optional`. This may confuse proto3 users who are used to
         `optional` not being a default they reach for. Will result in
         significant (trivial, but noisy) churn in proto3 files. The keyword is
         effectively line noise, since it does not indicate anything other than
         "this is a singular field".
     *   Invent a new label, like `singular`. This results in more churn but
         avoids breaking peoples' priors.
     *   Allow `optional` and no label to coexist in a file, which take on their
         original meanings unless overridden by `features.field_presence`. The
         fact that a top-level `features.field_presence = IMPLICIT` breaks the
         proto3 expectation that `optional` means `EXPLICIT` may be a source of
         confusion.
 *   `proto:allow_required`, which must be present for `required` to not be a
     syntax error.
 *   Allow `required`/`optional` and introduce `defaulted` as a real keyword. We
     will not have another easy chance to introduce such syntax (which we do,
     because `edition = ...` is a breaking change).
 *   Reject custom defaults for `IMPLICIT` fields. This is technically not really
     needed for converged semantics, but trying to remove the Proto3-ness from
     `IMPLICIT` fields seems useful for consistency.

 #### Future Work

 In the future, we can introduce something like `features.always_serialize` or a
 similar new enumerator (`ALWAYS_SERIALIZE`) to the `when_missing` enum, which
 makes `EXPLICIT_PRESENCE` fields unconditionally serialized, allowing
 `LEGACY_REQUIRED` fields to become `EXPLICIT_PRESENCE` in a future large-scale
 change. The details of such a migration are out-of-scope for this document.

 #### Migration Examples

 Given the following files:

 ```
 // foo.proto
 syntax = "proto2"

 message Foo {
   required int32 x = 1;
   optional int32 y = 2;
   repeated int32 z = 3;
 }

 // bar.proto
 syntax = "proto3"

 message Bar {
   int32 x = 1;
   optional int32 y = 2;
   repeated int32 z = 3;
 }
 ```

 post-editions, they will look like this:

 ```
 // foo.proto
 edition = "tbd"

 message Foo {
   int32 x = 1 [features.field_presence = LEGACY_REQUIRED];
   int32 y = 2;
   repeated int32 z = 3;
 }

 // bar.proto
 edition = "tbd"
 option features.field_presence = NO_PRESENCE;

 message Bar {
   int32 x = 1;
   int32 y = 2 [features.field_presence = EXPLICIT_PRESENCE];
   repeated int32 z = 3;
 }
 ```

 ### features.enum_type

 Enum types come in two distinct flavors: *closed* and *open*.

 *   *closed* enums will store enum values that are out of range in the unknown
     field set.
 *   *open* enums will parse out of range values into their fields directly.

     **NOTE:** Closed enums can cause confusion for parallel arrays (two repeated
     fields that expect to have index i refer to the same logical concept in both
     fields) because an unknown enum value from a parallel array will be placed
     in the unknown field set and the arrays will cease being parallel. Similarly
     parsing and serializing can change the order of a repeated closed enum by
     moving unknown values to the end.

     **NOTE:** Some runtimes (C++ and Java, in particular) currently do not use
     the declaration site of enums to determine whether an enum field is treated
     as open; rather, they use the syntax of the message the field is defined in,
     instead. To preserve this proto2 quirk until we can migrate users off of it,
     Java and C++ (and runtimes with the same quirk) will use the value of
     `features.enum` as set at the file level of messages (so, if a file sets
     `features.enum = CLOSED` at the file level, enum fields defined in it behave
     as if the enum was closed, regardless of declaration). IMPLICIT singular
     fields in Java and C++ ignore this and are always treated as open, because
     they used to only be possible to define in proto3 files, which can't use
     proto2 enums.

 In proto2, `enum` values are closed and no requirements are placed upon the
 first `enum` value. The first enum value will be used as the default value.

 In proto3, `enum` values are open and the first `enum` value must be zero. The
 first `enum` value is used as the default value, but that value is required to
 be zero.

 In edition zero, We will add a feature `features.enum_type = {CLOSED,OPEN}`. The
 default will be `OPEN`. Upgraded proto2 files will explicitly set
 `features.enum_type = CLOSED`. The requirement of having the first enum value be
 zero will be dropped.

 **NOTE:** Nominally this exposes a new state in the configuration space, OPEN
 enums with a non-zero default. We decided that excluding this option simply
 because it was previously inexpressible was a false economy.

 #### Alternatives

 *   We could add a property for requiring a zero first value for an enum. This
     feels needlessly complicated.
 *   We could drop the ability to have `CLOSED` enums, but that is a semantic
     change.

 #### Migration Examples

 Given the following files:

 ```
 // foo.proto
 syntax = "proto2"

 enum Foo {
   A = 2, B = 4, C = 6,
 }

 // bar.proto
 syntax = "proto3"

 enum Bar {
   A = 0, B = 1, C = 5,
 }
 ```

 post-editions, they will look like this:

 ```
 // foo.proto
 edition = "tbd"
 option features.enum_type = CLOSED;

 enum Foo {
   A = 2, B = 4, C = 6,
 }

 // bar.proto
 edition = "tbd"

 enum Bar {
   A = 0, B = 1, C = 5,
 }
 ```

 If we wanted to merge them into one file, it would look like this:

 ```
 // foo.proto
 edition = "tbd"

 enum Foo {
   option features.enum_type = CLOSED;
   A = 2, B = 4, C = 6,
 }


 enum Bar {
   A = 0, B = 1, C = 5,
 }
 ```

 ### features.repeated_field_encoding

 In proto3, the `repeated_field_encoding` attribute defaults to `PACKED`. In
 proto2, the `repeated_field_encoding` attribute defaults to `EXPANDED`. Users
 explicitly enabled packed fields 12.3k times, but only explicitly disable it 200
 times. Thus we can see a clear preference for `repeated_field_encoding = PACKED`
 emerge. This data matches best practices. As such, the default value for
 `features.repeated_field_encoding` will be `PACKED`.

 The existing `[packed = …]` syntax will be made an alias for setting the feature
 in edition zero. This alias will eventually be removed. Whether that removal
 happens during the initial large-scale change to enable edition zero or as a
 follow on will be decided at the time.

 In the long term, we would like to remove explicit usages of
 `features.repeated_field_encoding = EXPANDED`, but we would prefer to separate
 that large-scale change from the landing of edition zero. So we will explicitly
 set `features.repeated_field_encoding` to `EXPANDED` at the file level when we
 migrate proto2 files to edition zero.

 #### Alternatives

 *   Force everyone to use packed fields. This is a semantic change, which we're
     trying to avoid in edition zero.
 *   Don’t add `features.repeated_field_encoding` and instead specify `[packed =
     false]` when converting from proto2. This will be incredibly noisy,
     syntax-wise and diff-wise.

 #### Migration Examples

 Given the following files:

 ```
 // foo.proto
 syntax = "proto2"

 message Foo {
   repeated int32 x = 1;
   repeated int32 y = 2 [packed = true];
   repeated int32 z = 3;
 }

 // bar.proto
 syntax = "proto3"

 message Foo {
   repeated int32 x = 1;
   repeated int32 y = 2 [packed = false];
   repeated int32 z = 3;
 }
 ```

 post-editions, they will look like this:

 ```
 // foo.proto
 edition = "tbd"
 options features.repeated_field_encoding = EXPANDED;

 message Foo {
   repeated int32 x = 1;
   repeated int32 y = 2 [packed = true];
   repeated int32 z = 3;
 }


 // bar.proto
 edition = "tbd"

 message Foo {
   repeated int32 x = 1;
   repeated int32 y = 2 [packed = false];
   repeated int32 z = 3;
 }
 ```

 Note that post migration, we have not changed `packed` to
 `features.repeated_field_encoding = PACKED`, although we could choose to do so
 if the diff cost is not monumental. We prefer to defer to an LSC after editions
 are shipped, if possible.

 ### features.string_field_validation

 **WARNING:** UTF8 validation is actually messier than originally believed. This
 feature is being reconsidered in _Editions Zero Feature: utf8_validation_.

 This feature is a tristate:

 *   `MANDATORY` - this means that a runtime MUST verify UTF-8.
 *   `HINT` - this means that a runtime may refuse to parse invalid UTF-8, but it
     can also simply skip the check for performance in some build modes.
 *   `NONE` - this field behaves like a `bytes` field on the wire, but parsers
     may mangle the string in an unspecified way (for example, Java may insert
     spaces as replacement characters).

 The default will be `MANDATORY`.

 Long term, we would like to remove this feature and make all `string` fields
 `MANDATORY`.

 #### Alternatives

 *   Drop the UTF-8 requirements completely. This seems like it will create more
     problems than it will solve (e.g., random things relying on validation need
     to be fixed) and it will be a lot of work. This is also counter to the
     vision of string being a UTF-8 type, and bytes being its unchecked sibling.
 *   Make opt-in verification a hard requirement instead of a hint, so that users
     have a nice performance needle they can play with.

 #### Future Work

 In the infinite future, we would like to remove this feature and force all
 `string` fields to be UTF-8 validated. To do this, we need to recognize that
 what many callers want from their `string` fields is a `bytes` field with a
 `string`-like API. To ease the transition, we would add per-codegen backend
 features, like `java.bytes_as_string`, that give a `bytes` field a generated API
 resembling that of a `string` field (with caveats about replacement characters
 forced by the host language's string type).

 The migration would take `HINT` or `SKIP` `string` fields and convert them into
 `bytes` fields with the appropriate API modifiers, depending on which languages
 use that proto; C++-only protos, for example, are a no-op.

 There is an argument to be made for "I want a string type, and I explicitly want
 replacement U+FFFD characters if I get something that isn't UTF-8." It is
 unclear if this is something users want and we would need to investigate it
 before making a decision.

 ### features.json_format

 This feature is dual state in edition zero:

 *   `ALLOW` - this means that a runtime must allow JSON parsing and
     serialization. Checks will be applied at the proto level to make sure that
     there is a well-defined mapping to JSON.
 *   `LEGACY_BEST_EFFORT` - this means that a runtime will do the best it can to
     parse and serialize JSON. Certain protos will be allowed that can result in
     undefined behavior at runtime (e.g. many:1 or 1:many mappings).

 The default will be `ALLOW`, which maps the to the current proto3 behavior.
 `LEGACY_BEST_EFFORT` will be used for proto2 files that require it (e.g. they’ve
 set `deprecated_legacy_json_field_conflicts`)

 #### Alternatives

 *   Keep the proto2 behavior - this will regress proto3 files by removing
     validation for JSON mappings, and lead to *more* undefined runtime behavior
 *   Only use `ALLOW` - there are ~30 cases internally where protos have invalid
     JSON mappings and rely on unspecified (but luckily well defined) runtime
     behavior.

 #### Future Work

 Long term, we would like to either remove this feature entirely or add a
 `DISALLOW` option instead of `LEGACY_BEST_EFFORT`. This will more strictly
 enforce that protos without a valid JSON mapping *can’t* be serialized or parsed
 to JSON. `DISALLOW` will be enforced at the proto-language level, where no
 message marked `ALLOW` can contain any message/enum marked `DISALLOW` (e.g.
 through extensions or fields)

 #### Migration Examples

 ### Extensions are Always Allowed

 Extensions may be used on all messages. This lifts a restriction from proto3.

 Extensions do not play nicely with `TypeResolver`. This is actually fixable, but
 probably only worth it if someone complains.

 #### Alternatives

 *   Add `features.allow_extensions`, default true. This feels unnecessary since
     uttering `extend` and `extensions` is required to use extensions in the
     first place.

 ### features.message_encoding

 This feature defaults to `LENGTH_PREFIXED`. The `group` syntax does not exist
 under editions. Instead, message-typed fields that have
 `features.message_encoding = DELIMITED` set will be encoded as groups (wire type
 3/4) rather than byte blobs (wire type 2). This reflects the existing API
 (groups are funny message fields) and simplifies the parser.

 A `proto2` group field will be converted into a nested message type of the same
 name, and a singular submessage field that is `features.message_encoding =
 DELIMITED` with the message type's name in snake_case.

 This could be used in the future to switch new message fields to use group
 encoding, which suggested previously as an efficiency direction.

 #### Alternatives

 *   Allow groups in `editions` with no changes. `group` syntax is deprecated, so
     we may as well take the opportunity to knock it out.
 *   Add a sidecar allowlist like we do for `required`. This is mostly
     orthogonal.

 #### Migration Examples

 Given the following file

 ```
 // foo.proto
 syntax = "proto2"

 message Foo {
   group Bar = 1 {
     optional int32 x = 1;
     repeated int32 y = 2;
   }
 }
 ```

 post-editions, it will look like this:

 ```
 // foo.proto
 edition = "tbd"

 message Foo {
   message Bar {
     optional int32 x = 1;
     repeated int32 y = 2;
   }
   Bar bar = 1 [features.message_encoding = DELIMITED];
 }
 ```

 ## Proposed Features Message

 Putting together all of the above, we propose the following `Features` message,
 including retention and target rules associated with fields.

 ```
 message Features {
   enum FieldPresence {
     EXPLICIT = 0;
     IMPLICIT = 1;
     LEGACY_REQUIRED = 2;
   }
   optional FieldPresence field_presence = 1 [
       retention = RUNTIME,
       target = FILE,
       target = FIELD
   ];

   enum EnumType {
     OPEN = 0;
     CLOSED = 1;
   }
   optional EnumType enum = 2 [
       retention = RUNTIME,
       target = FILE,
       target = ENUM
   ];

   enum RepeatedFieldEncoding {
     PACKED = 0;
     UNPACKED = 1;
   }
   optional RepeatedFieldEncoding repeated_field_encoding = 3 [
       retention = RUNTIME,
       target = FILE,
       target = FIELD
   ];

   enum StringFieldValidation {
     MANDATORY = 0;
     HINT = 1;
     NONE = 2;
   }
   optional StringFieldValidation string_field_validation = 4 [
       retention = RUNTIME,
       target = FILE,
       target = FIELD
   ];

   enum MessageEncoding {
     LENGTH_PREFIXED = 0;
     DELIMITED = 1;
   }
   optional MessageEncoding message_encoding = 5 [
       retention = RUNTIME,
       target = FILE,
       target = FIELD
   ];

   extensions 1000;  // for features_cpp.proto
   extensions 1001;  // for features_java.proto
 }
 ```