docs/design/editions/editions-feature-extension-layout.md - third_party/github/protocolbuffers/protobuf - Git at Google

 # Editions: Feature Extension Layout

 **Author:** [@mkruskal-google](https://github.com/mkruskal-google),
 [@zhangskz](https://github.com/zhangskz)

 **Approved:** 2023-08-23

 ## Background

 "[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan
 for allowing for more targeted features not owned by the protobuf team. It uses
 extensions of the global features proto to implement this. One thing that was
 left a bit ambiguous was *who* should own these extensions. Language, code
 generator, and runtime implementations are all similar but not identical
 distinctions.

 "Editions Zero Feature: utf8_validation" (not available externally, though a
 later version,
 "[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)"
 is) is a recent plan to add a new set of generator features for utf8 validation.
 While the sole feature we had originally created (`legacy_closed_enum` in Java
 and C++) didn't have any ambiguity here, this one did. Specifically in Python,
 the current behaviors across proto2/proto3 are distinct for all 3
 implementations: pure python, Python/C++, Python/upb.

 ## Overview

 In meetings, we've discussed various alternatives, captured below. The original
 plan was to make feature extensions runtime implementation-specific (e.g. C++,
 Java, Python, upb). There are some notable complications that came up though:

 1.  **Polyglot** - it's not clear how upb or C++ runtimes should behave in
     multi-language situations. Which feature sets do they consider for runtime
     behaviors? *Note: this is already a serious issue today, where all proto2
     strings and many proto3 strings are completely unsafe across languages.*

 2.  **Shared Implementations** - Runtimes like upb and C++ are used as backing
     implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP).
     If we have a single set of `upb` or `cpp` features, migrating to those
     shared implementations would be more difficult (since there's no independent
     switches per-language). *Note: this is already the situation we're in today,
     where switching the runtime implementation can cause subtle and dangerous
     behavior changes.*

 Given that we only have two behaviors, and one of them is unambiguous, it seems
 reasonable to punt on this decision until we have more information. We may
 encounter more edge cases that require feature extensions (and give us more
 information) during the rollout of edition zero. We also have a lot of freedom
 to re-model features in later editions, so keeping the initial implementation as
 simple as possible seems best (i.e. Alternative 2).

 ## Alternatives

 ### Alternative 1: Runtime Implementation Features

 Features would be per-runtime implementation as originally described in
 "Editions Zero Feature: utf8_validation." For example, Protobuf Python users
 would set different features depending on the backing implementation (e.g.
 `features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`).

 #### Pros

 *   Most consistent with range of behaviors expressible pre-Editions

 #### Cons

 *   Implementation may / should not be obvious to users.
 *   Lack of levers specifically for language / implementation combos. For
     example, there is no way to set Python-C++ behavior independently of C++
     behavior which may make migration harder from other Python implementations.

 ### Alternative 2: Generator Features

 Features would be per-generator only (i.e. each protoc plugin would own one set
 of features). This was the second decision we made in later discussions, and
 while very similar to the above alternative, it's more inline with our goal of
 making features primarily for codegen.

 For example, all Python implementations would share the same set of features
 (e.g. `features.(pb.python).<feature>`). However, certain features could be
 targeted to specific implementations (e.g.
 `features.(pb.python).upb_utf8_validation` would only be used by Python/upb).

 #### Pros

 *   Allows independent controls of shared implementations in different target
     languages (e.g. Python's upb feature won't affect PHP).

 #### Cons

 *   Possible complexity in upb to understand which language's features to
     respect. UPB is not currently aware of what language it is being used for.
 *   Limits in-process sharing across languages with shared implementations (e.g.
     Python upb, PHP upb) in the case of conflicting behaviors.
     *   Additional checks may be needed.

 ### Alternative 3: Migrate to bytes

 Since this whole discussion revolves around the utf8 validation feature, one
 option would be to just remove it from edition zero. Instead of adding a new
 toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce
 utf8 today to `bytes`. This would likely need another new *codegen* feature for
 generating byte getters/setters as strings, but that wouldn't have any of the
 ambiguity we're seeing today.

 Unfortunately, this doesn't seem feasible because of all the different behaviors
 laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't
 really a binary on/off decision, and it can vary widely between languages. There
 are many cases where UTF8 is validated in **some** languages but not others, and
 there's also the C++ "hint" behavior that logs errors but allows invalid UTF8.

 **Note:** This could still be partially done in a follow-up LSC by targeting
 specific combinations of the new feature that disable validation in all relevant
 languages.

 #### Pros

 *   Punts on the issue, we wouldn't need any upb features and C++ features would
     all be code-gen only
 *   Simplifies the situation, avoids adding a very complicated feature in
     edition zero

 #### Cons

 *   Not really possible given the current complexity
 *   There are O(10M) proto2 string fields that would be blindly changed to bytes

 ### Alternative 4: Nested Features

 Another option is to allow for shared feature set messages. For example, upb
 would define a feature message, but *not* make it an extension of the global
 `FeatureSet`. Instead, languages with upb implementations would have a field of
 this type to allow for finer-grained controls. C++ would both extend the global
 `FeatureSet` and also be allowed as a field in other languages.

 For example, python utf8 validation could be specified as:

 We could have checks during feature validation that enforce that impossible
 combinations aren't specified. For example, with our current implementation
 `features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`,
 since we don't have any mechanism for distinguishing them.

 #### Pros

 *   Much more explicit than options 1 and 2

 #### Cons

 *   Maybe too explicit? Proto owners would be forced to duplicate a lot of
     features
	# Editions: Feature Extension Layout

	Author: [@mkruskal-google](https://github.com/mkruskal-google),
	[@zhangskz](https://github.com/zhangskz)

	Approved: 2023-08-23

	## Background

	"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan
	for allowing for more targeted features not owned by the protobuf team. It uses
	extensions of the global features proto to implement this. One thing that was
	left a bit ambiguous was who should own these extensions. Language, code
	generator, and runtime implementations are all similar but not identical
	distinctions.

	"Editions Zero Feature: utf8_validation" (not available externally, though a
	later version,
	"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)"
	is) is a recent plan to add a new set of generator features for utf8 validation.
	While the sole feature we had originally created (`legacy_closed_enum` in Java
	and C++) didn't have any ambiguity here, this one did. Specifically in Python,
	the current behaviors across proto2/proto3 are distinct for all 3
	implementations: pure python, Python/C++, Python/upb.

	## Overview

	In meetings, we've discussed various alternatives, captured below. The original
	plan was to make feature extensions runtime implementation-specific (e.g. C++,
	Java, Python, upb). There are some notable complications that came up though:

	1. Polyglot - it's not clear how upb or C++ runtimes should behave in
	multi-language situations. Which feature sets do they consider for runtime
	behaviors? *Note: this is already a serious issue today, where all proto2
	strings and many proto3 strings are completely unsafe across languages.*

	2. Shared Implementations - Runtimes like upb and C++ are used as backing
	implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP).
	If we have a single set of `upb` or `cpp` features, migrating to those
	shared implementations would be more difficult (since there's no independent
	switches per-language). *Note: this is already the situation we're in today,
	where switching the runtime implementation can cause subtle and dangerous
	behavior changes.*

	Given that we only have two behaviors, and one of them is unambiguous, it seems
	reasonable to punt on this decision until we have more information. We may
	encounter more edge cases that require feature extensions (and give us more
	information) during the rollout of edition zero. We also have a lot of freedom
	to re-model features in later editions, so keeping the initial implementation as
	simple as possible seems best (i.e. Alternative 2).

	## Alternatives

	### Alternative 1: Runtime Implementation Features

	Features would be per-runtime implementation as originally described in
	"Editions Zero Feature: utf8_validation." For example, Protobuf Python users
	would set different features depending on the backing implementation (e.g.
	`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`).

	#### Pros

	* Most consistent with range of behaviors expressible pre-Editions

	#### Cons

	* Implementation may / should not be obvious to users.
	* Lack of levers specifically for language / implementation combos. For
	example, there is no way to set Python-C++ behavior independently of C++
	behavior which may make migration harder from other Python implementations.

	### Alternative 2: Generator Features

	Features would be per-generator only (i.e. each protoc plugin would own one set
	of features). This was the second decision we made in later discussions, and
	while very similar to the above alternative, it's more inline with our goal of
	making features primarily for codegen.

	For example, all Python implementations would share the same set of features
	(e.g. `features.(pb.python).<feature>`). However, certain features could be
	targeted to specific implementations (e.g.
	`features.(pb.python).upb_utf8_validation` would only be used by Python/upb).

	#### Pros

	* Allows independent controls of shared implementations in different target
	languages (e.g. Python's upb feature won't affect PHP).

	#### Cons

	* Possible complexity in upb to understand which language's features to
	respect. UPB is not currently aware of what language it is being used for.
	* Limits in-process sharing across languages with shared implementations (e.g.
	Python upb, PHP upb) in the case of conflicting behaviors.
	* Additional checks may be needed.

	### Alternative 3: Migrate to bytes

	Since this whole discussion revolves around the utf8 validation feature, one
	option would be to just remove it from edition zero. Instead of adding a new
	toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce
	utf8 today to `bytes`. This would likely need another new codegen feature for
	generating byte getters/setters as strings, but that wouldn't have any of the
	ambiguity we're seeing today.

	Unfortunately, this doesn't seem feasible because of all the different behaviors
	laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't
	really a binary on/off decision, and it can vary widely between languages. There
	are many cases where UTF8 is validated in some languages but not others, and
	there's also the C++ "hint" behavior that logs errors but allows invalid UTF8.

	Note: This could still be partially done in a follow-up LSC by targeting
	specific combinations of the new feature that disable validation in all relevant
	languages.

	#### Pros

	* Punts on the issue, we wouldn't need any upb features and C++ features would
	all be code-gen only
	* Simplifies the situation, avoids adding a very complicated feature in
	edition zero

	#### Cons

	* Not really possible given the current complexity
	* There are O(10M) proto2 string fields that would be blindly changed to bytes

	### Alternative 4: Nested Features

	Another option is to allow for shared feature set messages. For example, upb
	would define a feature message, but not make it an extension of the global
	`FeatureSet`. Instead, languages with upb implementations would have a field of
	this type to allow for finer-grained controls. C++ would both extend the global
	`FeatureSet` and also be allowed as a field in other languages.

	For example, python utf8 validation could be specified as:

	We could have checks during feature validation that enforce that impossible
	combinations aren't specified. For example, with our current implementation
	`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`,
	since we don't have any mechanism for distinguishing them.

	#### Pros

	* Much more explicit than options 1 and 2

	#### Cons

	* Maybe too explicit? Proto owners would be forced to duplicate a lot of
	features